library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.6
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0
## -- Conflicts ----------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Background and Overview

DataCamp offer interactive courses related to R Programming. While some is review, it is helpful to see other perspectives on material. As well, DataCamp has some interesting materials on packages that I want to learn better (ggplot2, dplyr, ggvis, etc.). This document summarizes a few key insights from:

This document is currently split between _v003 and _v003_a and _v003_b and _v003_c due to the need to keep the number of DLL that it opens below the hard-coded maximum. This introductory section needs to be re-written, and the contents consolidated, at a future date.

The original DataCamp_Insights_v001 and DataCamp_Insights_v002 documents have been split for this document:


Hierarchical and Mixed Effects Models

Chapter 1 - Overview and Introduction

What is a hierarchical model?

  • Hierarchical data is nested within itself, and can be analyzed using the lme package
    • Example of students in a classroom - may not all be independent of each other due to teacher quality, building conditions, etc.
    • Hierarchical models can help with pooling means across small sample sizes
    • Repeated measurements (test scores each year) are also a common example of data that are not truly independent
  • Hierarchical models can include nested models and multi-level models
  • Regression frameworks can include pool information and random effects (vs. fixed effects) and mixed-effects and linear mixed-effects
  • Repeated sampling can have repeated measures modeling

Parts of a regression:

  • Linear regression and linear model can be used interchangeably for this course - epsilon is the error term, assumed to be normal with zero mean and constant variance
  • The linear model in R is closely related to analysis of variance (ANOVA)
    • lm (y ~ x, myData)
    • anova( lm (y ~ x, myData) )
  • The most basic regression has an intercept, a slope, a single predictor, and an error term
    • The concept can be extended to multiple regression with additional predictors
  • There are some limitations to the multiple regression approach
    • Parameter estimates can be very sensitive to other variables - Simpson’s paradox and the like
    • Need to note that the regression coefficient is “after controlling for . . .” (all the other variables)
    • Interaction terms can be important as well
  • Regressions in R for an intercept for every group are called as lm(y ~ x - 1)
  • The interaction term x1*x2 is the same as x1 + x2 + x1:x2

Random effects in regression:

  • Nested relationships tend to be hierarchical in nature - students are part of classes are part of schools and the like
    • Mathematically, this is referred to as a mapping among the distributions
  • The algebraic representation is that y ~ B*x + eps, with B ~ N(mu, sigma**2)
    • library(lme4) is the best packages for this in R
    • lme4::lmer(y ~ x + (1|randomGroup), data=myData)
    • lme4::lmer(y ~ x + (randomSlope|randomGroup), data=myData)

School data:

  • Appliciation of multi-level models to school data - influence of sex, teacher training, plotting parameter estmates

Example code includes:

rawStudent <- read.csv("./RInputFiles/classroom.csv")

studentData <- rawStudent %>%
    mutate(sex=factor(sex, labels=c("male", "female")), minority=factor(minority, labels=c("no", "yes")))


# Plot the data
ggplot(data = studentData, aes(x = housepov, y = mathgain)) +
    geom_point() +
    geom_smooth(method = 'lm')

# Fit a linear model
summary( lm(mathgain ~ housepov , data = studentData))
## 
## Call:
## lm(formula = mathgain ~ housepov, data = studentData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -168.226  -22.222   -1.306   19.763  195.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   56.937      1.674   34.02   <2e-16 ***
## housepov       3.531      7.515    0.47    0.639    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.63 on 1188 degrees of freedom
## Multiple R-squared:  0.0001858,  Adjusted R-squared:  -0.0006558 
## F-statistic: 0.2208 on 1 and 1188 DF,  p-value: 0.6385
# I have aggregated the data for you into two new datasets at the classroom- and school-levels (As a side note, if you want to learn how to aggregate data, the dplyr or data.table courses teach these skills)
# We will also compare the model outputs across all three outputs
# Note: how we aggregate the data is important
# I aggregated the data by taking the mean across the student data (in pseudo-code: mean(mathgain) by school or mean(mathgain) by classroom), 
# but another reasonable method for aggregating the data would be to aggregate by classroom first and school second

classData <- studentData %>%
    group_by(schoolid, classid) %>%
    summarize_at(vars(mathgain, mathprep, housepov, yearstea), mean, na.rm=TRUE)
str(classData)
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  312 obs. of  6 variables:
##  $ schoolid: int  1 1 2 2 2 3 3 3 3 4 ...
##  $ classid : int  160 217 197 211 307 11 137 145 228 48 ...
##  $ mathgain: num  65.7 57.4 49.5 69 68.8 ...
##  $ mathprep: num  2 3.25 2.5 2.33 2.3 3.83 2.25 3 2.17 2 ...
##  $ housepov: num  0.082 0.082 0.082 0.082 0.082 0.086 0.086 0.086 0.086 0.365 ...
##  $ yearstea: num  1 2 1 2 12.5 ...
##  - attr(*, "vars")= chr "schoolid"
##  - attr(*, "drop")= logi TRUE
schoolData <- studentData %>%
    group_by(schoolid) %>%
    summarize_at(vars(mathgain, mathprep, housepov, yearstea), mean, na.rm=TRUE)
str(schoolData)
## Classes 'tbl_df', 'tbl' and 'data.frame':    107 obs. of  5 variables:
##  $ schoolid: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ mathgain: num  59.6 65 88.9 35.2 60.2 ...
##  $ mathprep: num  2.91 2.35 2.95 2 3.75 ...
##  $ housepov: num  0.082 0.082 0.086 0.365 0.511 0.044 0.148 0.085 0.537 0.346 ...
##  $ yearstea: num  1.73 6.02 14.93 22 3 ...
# First, plot the hosepov and mathgain at the classroom-level from the classData data.frame
ggplot(data = classData, aes(x = housepov, y = mathgain)) +
    geom_point() +
    geom_smooth(method = 'lm')

# Second, plot the hosepov and mathgain at the school-level from the schoolData data.frame
ggplot(data = schoolData, aes(x = housepov, y = mathgain)) +
    geom_point() +
    geom_smooth(method = 'lm')

# Third, compare your liner regression results from the previous expercise to the two new models
summary( lm(mathgain ~ housepov, data = studentData)) ## student-level data
## 
## Call:
## lm(formula = mathgain ~ housepov, data = studentData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -168.226  -22.222   -1.306   19.763  195.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   56.937      1.674   34.02   <2e-16 ***
## housepov       3.531      7.515    0.47    0.639    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.63 on 1188 degrees of freedom
## Multiple R-squared:  0.0001858,  Adjusted R-squared:  -0.0006558 
## F-statistic: 0.2208 on 1 and 1188 DF,  p-value: 0.6385
summary( lm(mathgain ~ housepov, data = classData)) ## class-level data
## 
## Call:
## lm(formula = mathgain ~ housepov, data = classData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -80.479 -14.444  -1.447  13.151 156.516 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   58.160      2.542  22.879   <2e-16 ***
## housepov      -3.236     10.835  -0.299    0.765    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.14 on 310 degrees of freedom
## Multiple R-squared:  0.0002876,  Adjusted R-squared:  -0.002937 
## F-statistic: 0.08918 on 1 and 310 DF,  p-value: 0.7654
summary( lm(mathgain ~ housepov, data = schoolData)) ## school-level data
## 
## Call:
## lm(formula = mathgain ~ housepov, data = schoolData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.660  -9.947  -2.494   9.546  41.445 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   59.338      2.624  22.616   <2e-16 ***
## housepov     -11.948     10.987  -1.087    0.279    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.8 on 105 degrees of freedom
## Multiple R-squared:  0.01114,    Adjusted R-squared:  0.00172 
## F-statistic: 1.183 on 1 and 105 DF,  p-value: 0.2793
# Plot the means of your data, predictor is your x-variable, response is your y-variable, and intDemo is your data.frame
intDemo <- data.frame(predictor=factor(c(rep("a", 5), rep("b", 5), rep("c", 5))), 
                      response=c(-1.207, 0.277, 1.084, -2.346, 0.429, 5.759, 4.138, 4.18, 4.153, 3.665, 9.046, 8.003, 8.447, 10.129, 11.919)
                      )
str(intDemo)
## 'data.frame':    15 obs. of  2 variables:
##  $ predictor: Factor w/ 3 levels "a","b","c": 1 1 1 1 1 2 2 2 2 2 ...
##  $ response : num  -1.207 0.277 1.084 -2.346 0.429 ...
ggIntDemo <- ggplot(intDemo, aes(x = predictor, y = response) ) +
    geom_point() +
    theme_minimal() + stat_summary(fun.y = "mean", color = "red",
                                   size = 3, geom = "point") +
    xlab("Intercept groups")
print(ggIntDemo)

# Fit a linear model to your data where response is "predicted by"(~) predictor
intModel <- lm( response ~ predictor - 1 , data = intDemo)
summary(intModel)
## 
## Call:
## lm(formula = response ~ predictor - 1, data = intDemo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9934 -0.7842 -0.2260  0.7056  2.4102 
## 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)    
## predictora  -0.3526     0.5794  -0.609    0.554    
## predictorb   4.3790     0.5794   7.558 6.69e-06 ***
## predictorc   9.5088     0.5794  16.412 1.38e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.296 on 12 degrees of freedom
## Multiple R-squared:  0.9646, Adjusted R-squared:  0.9557 
## F-statistic:   109 on 3 and 12 DF,  p-value: 5.696e-09
extractAndPlotResults <- function(intModel){
    intCoefPlot <- broom::tidy(intModel)
    intCoefPlot$term <- factor(gsub("predictor", "", intCoefPlot$term))

    plotOut <- ggIntDemo + geom_point(data = intCoefPlot,
                           aes(x = term, y = estimate),
                           position = position_dodge(width = 0.4),
                           color = 'blue', size = 8, alpha = 0.25)
    print(plotOut)
}


# Run the next code that extracts out the model's coeffiecents and plots them 
extractAndPlotResults(intModel)

multIntDemo <- data.frame(group=factor(c(rep("a", 5), rep("b", 5), rep("c", 5))), 
                          x=rep(0:4, times=3), 
                          intercept=c(4.11, -1.69, 1.09, 1.9, 1.21, 4.63, 10.29, 4.67, 12.06, 4.78, 15.22, 19.15, 4.44, 8.88, 9.47), 
                          response=c(4.11, 2.31, 9.09, 13.9, 17.21, 4.63, 14.29, 12.67, 24.06, 20.78, 15.22, 23.15, 12.44, 20.88, 25.47)
                          )
str(multIntDemo)
## 'data.frame':    15 obs. of  4 variables:
##  $ group    : Factor w/ 3 levels "a","b","c": 1 1 1 1 1 2 2 2 2 2 ...
##  $ x        : int  0 1 2 3 4 0 1 2 3 4 ...
##  $ intercept: num  4.11 -1.69 1.09 1.9 1.21 ...
##  $ response : num  4.11 2.31 9.09 13.9 17.21 ...
plot_output1 <- function(out1){
    ggmultIntgDemo <- ggplot( multIntDemo, aes(x = x, y = response) ) +
        geom_point(aes(color = group)) +
        theme_minimal() +
        scale_color_manual(values = c("blue", "black", "red")) +
        stat_smooth(method = 'lm', fill = NA, color = 'orange', size = 3)
    print(ggmultIntgDemo)
}

plot_output2 <- function(out2){
    out2Tidy <- broom::tidy(out2)
    out2Tidy$term <- gsub("group", "", out2Tidy$term)
    out2Plot <- data.frame(group = out2Tidy[ -1, 1],
                           slope = out2Tidy[ 1, 2],
                           intercept = out2Tidy[ -1, 2])
    ggmultIntgDemo2 <- ggplot( multIntDemo, aes(x = x, y = response) ) +
        geom_point(aes(color = group))+
        theme_minimal() +
        scale_color_manual(values = c("blue", "black", "red")) +
        geom_abline(data = out2Plot,
                    aes(intercept = intercept, slope = slope, color = group))
    print(ggmultIntgDemo2)
}

plot_output3 <- function(out3){
    ggmultIntgDemo3 <- ggplot( multIntDemo, aes(x = x, y = response) ) +
        geom_point(aes(color = group)) +
        theme_minimal() +
        scale_color_manual(values = c("blue", "black", "red")) +
        stat_smooth(method = 'lm', aes(color = group), fill = NA)
    print(ggmultIntgDemo3)
}

# First, run a model without considering different intercept for each group
out1 <- lm( response ~ x, data=multIntDemo )
summary(out1)
## 
## Call:
## lm(formula = response ~ x, data = multIntDemo)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.101 -4.021 -2.011  3.590 11.739 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    8.141      2.615   3.113  0.00824 **
## x              3.270      1.068   3.062  0.00908 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.848 on 13 degrees of freedom
## Multiple R-squared:  0.4191, Adjusted R-squared:  0.3744 
## F-statistic: 9.378 on 1 and 13 DF,  p-value: 0.009081
plot_output1(out1)

# Considering same slope but different intercepts
out2 <- lm( response ~ x + group - 1, data=multIntDemo )
summary(out2)
## 
## Call:
## lm(formula = response ~ x + group - 1, data = multIntDemo)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.992 -2.219 -0.234  1.810  6.988 
## 
## Coefficients:
##        Estimate Std. Error t value Pr(>|t|)    
## x        3.2697     0.7516   4.350 0.001155 ** 
## groupa   2.7847     2.3767   1.172 0.266085    
## groupb   8.7467     2.3767   3.680 0.003625 ** 
## groupc  12.8927     2.3767   5.425 0.000209 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.117 on 11 degrees of freedom
## Multiple R-squared:  0.9534, Adjusted R-squared:  0.9364 
## F-statistic: 56.23 on 4 and 11 DF,  p-value: 2.97e-07
plot_output2(out2)

# Consdering different slope and intercept for each group (i.e., an interaction)
out3 <- lm( response ~ x + group - 1 + x:group, multIntDemo)
summary(out3)
## 
## Call:
## lm(formula = response ~ x + group - 1 + x:group, data = multIntDemo)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.992 -2.429 -0.234  2.368  5.541 
## 
## Coefficients:
##          Estimate Std. Error t value Pr(>|t|)    
## x           3.779      1.308   2.888 0.017941 *  
## groupa      1.766      3.205   0.551 0.595053    
## groupb      6.872      3.205   2.144 0.060621 .  
## groupc     15.786      3.205   4.925 0.000819 ***
## x:groupb    0.428      1.851   0.231 0.822263    
## x:groupc   -1.956      1.851  -1.057 0.318050    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.138 on 9 degrees of freedom
## Multiple R-squared:  0.9615, Adjusted R-squared:  0.9358 
## F-statistic: 37.42 on 6 and 9 DF,  p-value: 7.263e-06
plot_output3(out3)

multIntDemo$intercept <- c(-0.87, 3.35, 1.25, 0.88, -1.05, 4.55, 1.22, 3.34, 1.26, 3.75, 7.71, 9.59, 2.28, 1.9, 13.35)
multIntDemo$response <- c(-0.87, 6.35, 7.25, 9.88, 10.95, 4.55, 4.22, 9.34, 10.26, 15.75, 7.71, 12.59, 8.28, 10.9, 25.35)

# Run model
outLmer <- lme4::lmer( response ~ x + ( 1 | group), multIntDemo)

# Look at model outputs 
summary( outLmer )
## Linear mixed model fit by REML ['lmerMod']
## Formula: response ~ x + (1 | group)
##    Data: multIntDemo
## 
## REML criterion at convergence: 76.9
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.31582 -0.61105 -0.01593  0.45125  2.19118 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  group    (Intercept)  7.98    2.825   
##  Residual             10.71    3.272   
## Number of obs: 15, groups:  group, 3
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   3.5540     2.1913   1.622
## x             2.9733     0.5975   4.977
## 
## Correlation of Fixed Effects:
##   (Intr)
## x -0.545
broom::tidy( outLmer )
##                      term estimate std.error statistic    group
## 1             (Intercept) 3.554000 2.1912887  1.621877    fixed
## 2                       x 2.973333 0.5974739  4.976507    fixed
## 3    sd_(Intercept).group 2.824834        NA        NA    group
## 4 sd_Observation.Residual 3.272500        NA        NA Residual
extractAndPlotOutput <- function(outLmer, slope=3){
    multIntDemo$lmerPredict <- predict(outLmer)
    ggmultIntgDemo2 <- ggplot( multIntDemo, aes(x = x, y = response) ) +
        geom_point(aes(color = group))+
        theme_minimal() +
        scale_color_manual(values = c("blue", "black", "red")) +
        geom_abline(data = multIntDemo,
                    aes(intercept = intercept, slope = slope, color = group))
    outPlot <-  ggmultIntgDemo2 +
                geom_line( data =  multIntDemo,
                      aes(x = x, y = lmerPredict, color = group),
                      linetype = 2)
    print(outPlot)
}


# Extract predictor variables and plot
extractAndPlotOutput(outLmer)

# Random effect slopes
multIntDemo$response <- c(-0.72, 1.5, 4.81, 6.61, 13.62, 10.21, 9.64, 11.91, 16.39, 16.97, 8.76, 14.79, 15.83, 15.27, 17.36)
multIntDemo$intercept <- c(-0.72, -1.5, -1.19, -2.39, 1.62, 10.21, 6.64, 5.91, 7.39, 4.97, 8.76, 11.79, 9.83, 6.27, 5.36)

outLmer2 <- lme4::lmer( response ~ ( x|group ), multIntDemo)
summary(outLmer2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: response ~ (x | group)
##    Data: multIntDemo
## 
## REML criterion at convergence: 69.7
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.56747 -0.54105 -0.06286  0.75141  1.27947 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  group    (Intercept) 273.998  16.553        
##           x             6.096   2.469   -1.00
##  Residual               2.466   1.570        
## Number of obs: 15, groups:  group, 3
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   21.676      1.383   15.67
broom::tidy(outLmer2)
##                      term  estimate std.error statistic    group
## 1             (Intercept) 21.675699  1.383061  15.67226    fixed
## 2    sd_(Intercept).group 16.552871        NA        NA    group
## 3              sd_x.group  2.468959        NA        NA    group
## 4 cor_(Intercept).x.group -1.000000        NA        NA    group
## 5 sd_Observation.Residual  1.570349        NA        NA Residual
plotOutput <- function(outLmer2){
    multIntDemo$lmerPredict2 <- predict(outLmer2)
    ggmultIntgDemo3 <- ggplot( multIntDemo, aes(x = x, y = response) ) +
        geom_point(aes(color = group)) +
        theme_minimal() +
        scale_color_manual(values = c("blue", "black", "red")) +
        stat_smooth(method = 'lm', aes(color = group), fill = NA)
    plotOut <- ggmultIntgDemo3 +
            geom_line( data =  multIntDemo,
                      aes(x = x, y = lmerPredict2, color = group),
                      linetype = 2)
    print(plotOut)
}


# Extract and plot
plotOutput(outLmer2)

# Mixed effect model
lmerModel <- lme4::lmer(mathgain ~ sex + 
                  mathprep + mathknow + (1|classid) +
                  (1|schoolid), data = studentData, na.action = "na.omit",
                  REML = TRUE)
summary(lmerModel)
## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## mathgain ~ sex + mathprep + mathknow + (1 | classid) + (1 | schoolid)
##    Data: studentData
## 
## REML criterion at convergence: 10677.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.3203 -0.6146 -0.0294  0.5467  5.5331 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  classid  (Intercept)  103.57  10.177  
##  schoolid (Intercept)   85.44   9.244  
##  Residual             1019.47  31.929  
## Number of obs: 1081, groups:  classid, 285; schoolid, 105
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   52.250      3.838  13.613
## sexfemale     -1.526      2.041  -0.747
## mathprep       2.426      1.298   1.869
## mathknow       2.405      1.299   1.851
## 
## Correlation of Fixed Effects:
##           (Intr) sexfml mthprp
## sexfemale -0.268              
## mathprep  -0.878  0.001       
## mathknow  -0.003  0.011  0.005
extractAndPlot <- function(lmerModel){
    modelOutPlot <- broom::tidy(lmerModel, conf.int = TRUE)
    modelOutPlot <- modelOutPlot[ modelOutPlot$group =="fixed" &
                               modelOutPlot$term != "(Intercept)", ]
    plotOut <- ggplot(modelOutPlot, aes(x = term, y = estimate,
                             ymin = conf.low,
                             ymax = conf.high)) +
            theme_minimal() +
            geom_hline(yintercept = 0.0, color = 'red', size = 2.0) +
            geom_point() +
            geom_linerange() + coord_flip()
    print(plotOut)
}


# Extract and plot 
extractAndPlot(lmerModel)


Chapter 2 - Linear Mixed-Effect Models

Linear mixed effect model - Birth rates data:

  • Small populations are highly sensitive to stochastic effects - if the mean is 1, a group of 5 might have 0 or 10
  • Questions about how counties may impact birth rates, over and above other demographic factors
    • Example of plotting birth rate vs. county - will see both the highest and lowest birth rates in the smallest counties
  • Random effect syntax for the lmer model includes
    • (1 | group) - random intercept with fixed mean
    • (1 | g1/g2) - intercepts vary among g1 and g2 within g2
    • (1 | g1) + (1 | g2) - random intercepts for two variables
    • x + (x | g) - correlated random slope and intercept
    • x + (x || g) - uncorrelated random slope and intercept
    • See lme4 documentation for additional details

Understanding and reporting the outputs of lmer:

  • The output from lmer is similar to the output from lm, but with some key differences - if using print(), will see
    • The method used is REML - restricted maximum likelihood - which tends to solve better than maximum likelihood for these problems
    • There is an REML convergence criteria, which can be a helpful diagnostic
    • Can see the standard deviations for both the state and the residual, along with the number of observations
    • Get the fixed effects coefficients in a similar form as lm()
  • The summary() call on lmer produces several additional outputs
    • Residuals summary
    • Fixed effects estimates include SE and t-values (but not p-values)
    • Correlations of fixed effects
  • Can grab only the fixed effects using fixef(myLMERObject)
    • Can grab only the random effects using ranef(myLMERObject), though these will not have confidence intervals
    • The random effects confidence intervals could be estimated using bootstrapping or Bayesian methods per the author of lme4 - but actual random effects are just unobserved random variables rather than parameters
  • Can grab only the confidence intervals using confint(myLMERObject)
  • Need to be careful in reporting the results - figures vs. tables vs. in-line descriptions

Statistical inference with Maryland crime data:

  • The Maryland crime data is available on data.gov - interesting for many public and private purposes
  • The null hypothsis test can be used with LMER - frequentist approach
    • By default, lmer does not provide p-values, as there is ongoing debate as to the degrees of freedom and impact on reported results
    • Can use lmerTest package to calculate and report on the p-values
  • Can use ANOVA to look at the variability explained by one model versus another model, and the associated degrees of freedom needed

Example code includes:

# Read in births data
rawBirths <- read.csv("./RInputFiles/countyBirthsDataUse.csv")
countyBirthsData <- rawBirths
str(countyBirthsData)
## 'data.frame':    580 obs. of  8 variables:
##  $ X                 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Year              : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ TotalPopulation   : int  203709 115620 103057 104173 660367 156993 353089 415395 226519 119565 ...
##  $ BirthRate         : num  11.5 12.1 11.8 12.4 13.3 ...
##  $ AverageBirthWeight: num  3261 3209 3239 3207 3177 ...
##  $ AverageAgeofMother: num  27.5 26.3 25.8 26.9 27.9 ...
##  $ CountyName        : Factor w/ 472 levels "Ada","Adams",..: 22 64 141 189 200 229 248 273 278 279 ...
##  $ State             : Factor w/ 50 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
# First, build a lmer with state as a random effect. Then look at the model's summary and the plot of residuals. 
birthRateStateModel <- lme4::lmer(BirthRate ~ (1|State), data=countyBirthsData)
summary(birthRateStateModel)
## Linear mixed model fit by REML ['lmerMod']
## Formula: BirthRate ~ (1 | State)
##    Data: countyBirthsData
## 
## REML criterion at convergence: 2411
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.7957 -0.6056 -0.1063  0.5211  5.5948 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  State    (Intercept) 1.899    1.378   
##  Residual             3.256    1.804   
## Number of obs: 578, groups:  State, 50
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  12.3362     0.2216   55.67
plot(birthRateStateModel)

# Next, plot the predicted values from the model ontop of the plot shown during the video.
countyBirthsData$birthPredictState <- predict(birthRateStateModel, countyBirthsData)
ggplot() + theme_minimal() +
    geom_point(data =countyBirthsData, aes(x = TotalPopulation, y = BirthRate)) + 
    geom_point(data = countyBirthsData, aes(x = TotalPopulation, y = birthPredictState),
               color = 'blue', alpha = 0.5
               )
## Warning: Removed 2 rows containing missing values (geom_point).

## Warning: Removed 2 rows containing missing values (geom_point).

# Include the AverageAgeofMother as a fixed effect within the lmer and state as a random effect
ageMotherModel <- lme4::lmer( BirthRate ~ AverageAgeofMother + (1|State), data=countyBirthsData)
summary(ageMotherModel)
## Linear mixed model fit by REML ['lmerMod']
## Formula: BirthRate ~ AverageAgeofMother + (1 | State)
##    Data: countyBirthsData
## 
## REML criterion at convergence: 2347.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.9602 -0.6086 -0.1042  0.5144  5.2686 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  State    (Intercept) 1.562    1.250   
##  Residual             2.920    1.709   
## Number of obs: 578, groups:  State, 50
## 
## Fixed effects:
##                    Estimate Std. Error t value
## (Intercept)        27.57033    1.81801  15.165
## AverageAgeofMother -0.53549    0.06349  -8.434
## 
## Correlation of Fixed Effects:
##             (Intr)
## AvrgAgfMthr -0.994
# Compare the random-effect model to the linear effect model 
summary(lm(BirthRate ~ AverageAgeofMother, data = countyBirthsData))
## 
## Call:
## lm(formula = BirthRate ~ AverageAgeofMother, data = countyBirthsData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8304 -1.3126 -0.1795  1.2198  8.7327 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        29.06637    1.83374  15.851   <2e-16 ***
## AverageAgeofMother -0.59380    0.06441  -9.219   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.065 on 576 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.1286, Adjusted R-squared:  0.1271 
## F-statistic: 84.99 on 1 and 576 DF,  p-value: < 2.2e-16
# Include the AverageAgeofMother as a correlated random-effect slope parameter
ageMotherModelRandomCorrelated <- lme4::lmer(BirthRate ~ AverageAgeofMother + (AverageAgeofMother|State),
                       countyBirthsData)
summary(ageMotherModelRandomCorrelated)
## Linear mixed model fit by REML ['lmerMod']
## Formula: BirthRate ~ AverageAgeofMother + (AverageAgeofMother | State)
##    Data: countyBirthsData
## 
## REML criterion at convergence: 2337.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8399 -0.5966 -0.1133  0.5228  5.1815 
## 
## Random effects:
##  Groups   Name               Variance Std.Dev. Corr 
##  State    (Intercept)        78.75816 8.8746        
##           AverageAgeofMother  0.08482 0.2912   -0.99
##  Residual                     2.80306 1.6742        
## Number of obs: 578, groups:  State, 50
## 
## Fixed effects:
##                    Estimate Std. Error t value
## (Intercept)        27.22042    2.41281  11.282
## AverageAgeofMother -0.52347    0.08302  -6.306
## 
## Correlation of Fixed Effects:
##             (Intr)
## AvrgAgfMthr -0.997
# Include the AverageAgeofMother as a correlated random-effect slope parameter
ageMotherModelRandomUncorrelated <- lme4::lmer(BirthRate ~ AverageAgeofMother + 
                                                    (AverageAgeofMother || State), data=countyBirthsData
                                               )
summary(ageMotherModelRandomUncorrelated)
## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## BirthRate ~ AverageAgeofMother + ((1 | State) + (0 + AverageAgeofMother |  
##     State))
##    Data: countyBirthsData
## 
## REML criterion at convergence: 2347.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.9602 -0.6086 -0.1042  0.5144  5.2686 
## 
## Random effects:
##  Groups   Name               Variance Std.Dev.
##  State    (Intercept)        1.562    1.250   
##  State.1  AverageAgeofMother 0.000    0.000   
##  Residual                    2.920    1.709   
## Number of obs: 578, groups:  State, 50
## 
## Fixed effects:
##                    Estimate Std. Error t value
## (Intercept)        27.57033    1.81801  15.165
## AverageAgeofMother -0.53549    0.06349  -8.434
## 
## Correlation of Fixed Effects:
##             (Intr)
## AvrgAgfMthr -0.994
out <- ageMotherModelRandomUncorrelated

# Extract the fixed-effect coefficients
lme4::fixef(out)
##        (Intercept) AverageAgeofMother 
##         27.5703303         -0.5354886
# Extract the random-effect coefficients
lme4::ranef(out)
## $State
##    (Intercept) AverageAgeofMother
## AK  1.03549149                  0
## AL -0.52500819                  0
## AR  0.48023356                  0
## AZ -1.04094123                  0
## CA  0.50530542                  0
## CO  0.09585582                  0
## CT -1.91638101                  0
## DC  0.96029532                  0
## DE -0.38938118                  0
## FL -1.87440508                  0
## GA  0.39776424                  0
## HI  0.08513474                  0
## IA  0.96279528                  0
## ID  1.17377458                  0
## IL -0.12739337                  0
## IN -0.32655206                  0
## KS  0.85650338                  0
## KY  0.64871300                  0
## LA  1.04437463                  0
## MA -1.40082047                  0
## MD  0.10842918                  0
## ME -1.63520397                  0
## MI -1.13797832                  0
## MN  0.93266233                  0
## MO  0.07081901                  0
## MS -0.21397453                  0
## MT -0.13190265                  0
## NC -0.28681241                  0
## ND  0.99847758                  0
## NE  1.49390428                  0
## NH -1.45440958                  0
## NJ -0.30089452                  0
## NM -0.69753301                  0
## NV  0.09012925                  0
## NY -0.58163335                  0
## OH -1.07390325                  0
## OK  0.77997159                  0
## OR -0.75845586                  0
## PA -1.59332743                  0
## RI -1.36395356                  0
## SC -0.59295090                  0
## SD  1.35141914                  0
## TN -0.13512968                  0
## TX  1.70872465                  0
## UT  3.66056804                  0
## VA  1.59187553                  0
## VT -0.51105276                  0
## WA  0.23008359                  0
## WI -0.51646717                  0
## WV -0.67684007                  0
# Estimate the confidence intervals 
(ciOut <- confint(out))
## Computing profile confidence intervals ...
##                         2.5 %      97.5 %
## .sig01              0.0000000  1.61243911
## .sig02              0.0000000  0.05033967
## .sigma              1.6091449  1.81592859
## (Intercept)        24.0121848 31.14668434
## AverageAgeofMother -0.6605319 -0.41123099
# Technical note: Extracting out the regression coefficients from lmer is tricky (see discussion between the lmer and broom authors development)
# Extract out the parameter estimates and confidence intervals and manipulate the data
dataPlot <- data.frame(cbind( lme4::fixef(out), ciOut[ 4:5, ]))
rownames(dataPlot)[1] <- "Intercept"
colnames(dataPlot) <- c("mean", "l95", "u95")
dataPlot$parameter <- rownames(dataPlot)

# Print the new dataframe
print(dataPlot)
##                          mean        l95       u95          parameter
## Intercept          27.5703303 24.0121848 31.146684          Intercept
## AverageAgeofMother -0.5354886 -0.6605319 -0.411231 AverageAgeofMother
# Plot the results using ggplot2
ggplot(dataPlot, aes(x = parameter, y = mean,
                     ymin = l95, ymax = u95)) +
    geom_hline( yintercept = 0, color = 'red' ) +
    geom_linerange() + geom_point() + coord_flip() + theme_minimal()

# Read in crime data
rawCrime <- read.csv("./RInputFiles/MDCrime.csv")
MDCrime <- rawCrime
str(MDCrime)
## 'data.frame':    192 obs. of  5 variables:
##  $ X     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ County: Factor w/ 24 levels "ALLEGANY","ANNE ARUNDEL",..: 2 3 4 5 6 7 8 9 10 11 ...
##  $ Year  : int  2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
##  $ Crime : int  3167 10871 5713 257 149 374 490 729 181 752 ...
##  $ Year2 : int  0 0 0 0 0 0 0 0 0 0 ...
plot1 <- ggplot(data = MDCrime, aes(x = Year, y = Crime, group = County)) +
    geom_line() + theme_minimal() +
    ylab("Major crimes reported per county")
print(plot1)

plot1 + geom_smooth(method = 'lm')

# Null hypothesis testing uses p-values to see if a variable is "significant"
# Recently, the abuse and overuse of null hypothesis testing and p-values has caused the American Statistical Association to issue a statement about the use of p-values
# Because of these criticisms and other numerical challenges, Doug Bates (the creator of the lme4 package) does not include p-values as part of his package
# However, you may still want to estimate p-values, because p-values are sill commonly used. Several packages exist, including the lmerTest package
# https://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdf

# Load lmerTest
# library(lmerTest)

# Fit the model with Year as both a fixed and random-effect
lme4::lmer(Crime ~ Year + (1 + Year | County) , data = MDCrime)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
## control$checkConv, : Model failed to converge with max|grad| = 0.338309
## (tol = 0.002, component 1)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables?
## Linear mixed model fit by REML ['lmerMod']
## Formula: Crime ~ Year + (1 + Year | County)
##    Data: MDCrime
## REML criterion at convergence: 2891.03
## Random effects:
##  Groups   Name        Std.Dev. Corr
##  County   (Intercept) 655.2118     
##           Year          0.8322 1.00
##  Residual             328.2865     
## Number of obs: 192, groups:  County, 24
## Fixed Effects:
## (Intercept)         Year  
##   136642.97       -67.33  
## convergence code 0; 3 optimizer warnings; 0 lme4 warnings
# Fit the model with Year2 rather than Year
out <- lme4::lmer(Crime ~ Year2 + (1 + Year2 | County) , data = MDCrime)

# Examine the model's output
summary(out)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Crime ~ Year2 + (1 + Year2 | County)
##    Data: MDCrime
## 
## REML criterion at convergence: 2535.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.8081 -0.2235 -0.0390  0.2837  3.0768 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  County   (Intercept) 7587959  2754.63       
##           Year2         16945   130.17  -0.91
##  Residual                8425    91.79       
## Number of obs: 192, groups:  County, 24
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  1577.28     562.42   2.804
## Year2         -67.33      26.73  -2.519
## 
## Correlation of Fixed Effects:
##       (Intr)
## Year2 -0.907
## Build the Null model with only County as a random-effect
null_model <- lme4::lmer(Crime ~ (1 | County) , data = MDCrime)

## Build the Year2 model with Year2 as a fixed and random slope and County as the random-effect
year_model <- lme4::lmer(Crime ~ Year2 + (1 + Year2 | County) , data = MDCrime)

## Compare the two models using an anova
anova(null_model, year_model)
## refitting model(s) with ML (instead of REML)
## Data: MDCrime
## Models:
## null_model: Crime ~ (1 | County)
## year_model: Crime ~ Year2 + (1 + Year2 | County)
##            Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
## null_model  3 2954.4 2964.2 -1474.2   2948.4                             
## year_model  6 2568.9 2588.4 -1278.4   2556.9 391.52      3  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Chapter 3 - Generalized Linear Mixed-Effect Models

Crash course on GLMs - relaxing the assumptions around normality of the residuals:

  • Non-normal data can be transformed using arcsin or the like
  • However, with advances in methodology, it is possible to more directly model the data using binomial and poisson distributions
  • The basic glm call is glm(y ~ x, family=“”) # default is family=“gaussian”, which same as the lm()
  • The Poisson distribution is frequently best for count data, such as website visitors per hour - mean equals variance (generally best for small counts less than 30; can use normals for large counts)
  • For logistic regression, data can be entered in any of three formats
    • Binary (y=0 or 1) - glm(y ~ x, family=“binomial”)
    • Wilkinson-Rogers - glm(cbind(success, failure) ~ x, family=“binomial”)
    • Weighted - glm(y ~ x, weights=weights, family=“binomial”)
    • These methods differ primarily in the degrees of freedom (and thus deviance)

Binomial data - modeling data with only two outcomes:

  • Traditional method for analysis includes looking at proportion of successes
  • The GLM allows for direct looks at the data - logistic regression (logit)
  • Binomial data can be fit using glmer(y ~ x + (1/group), family=“error term”)
  • The regression coefficients can be difficult to explain, sometimes leading to the use of odds ratios instead
    • The odds ratio of 2.0 would mean 2:1 odds for that specific group

Count data:

  • Examples like number of events per hour (website hits) or counts per area (birds)
  • The count data differs from the binomial in that there is no pre-specified upper boundary
  • While Chi-squared is often used for goodness of fit or test of association for count data, the Poisson GLM can be a nice alternative
    • glm(y ~ x, family=“poisson”)
    • glmer(y ~ x + (1|group), family=“poisson”)

Example code includes:

# In this case study, we will be working with simulated dose-response data
# The response is mortality (1) or survival (0) at the end of a study. During this exercise, we will fit a logistic regression using all three methods described in the video
# You have been given two datasets. dfLong has the data in a "long" format with each row corresponding to an observation (i.e., a 0 or 1)
# dfShort has the data in an aggregated format with each row corresponding to a treatment (e.g., 6 successes, 4 failures, number of replicates = 10, proportion = 0.6)

dfLong <- data.frame(dose=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10), 
                     mortality=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1)
                     )
str(dfLong)
## 'data.frame':    120 obs. of  2 variables:
##  $ dose     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ mortality: num  0 0 0 0 0 0 0 0 0 0 ...
dfShort <- dfLong %>% 
    group_by(dose) %>%
    summarize(mortality=sum(mortality), nReps=n()) %>%
    mutate(survival=nReps-mortality, mortalityP=mortality/nReps)
dfShort
## # A tibble: 6 x 5
##    dose mortality nReps survival mortalityP
##   <dbl>     <dbl> <int>    <dbl>      <dbl>
## 1  0         0       20    20.0       0    
## 2  2.00      4.00    20    16.0       0.200
## 3  4.00      8.00    20    12.0       0.400
## 4  6.00     10.0     20    10.0       0.500
## 5  8.00     11.0     20     9.00      0.550
## 6 10.0      13.0     20     7.00      0.650
# Fit a glm using data in a long format
fitLong <- glm(mortality ~ dose, data = dfLong, family = "binomial")
summary(fitLong)
## 
## Call:
## glm(formula = mortality ~ dose, family = "binomial", data = dfLong)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5916  -0.8245  -0.4737   1.0440   1.8524  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.13075    0.44532  -4.785 1.71e-06 ***
## dose         0.30663    0.06821   4.495 6.95e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 159.76  on 119  degrees of freedom
## Residual deviance: 134.71  on 118  degrees of freedom
## AIC: 138.71
## 
## Number of Fisher Scoring iterations: 3
# Fit a glm using data in a short format with two columns
fitShort <- glm( cbind(mortality , survival ) ~ dose , data = dfShort, family = "binomial")
summary(fitShort)
## 
## Call:
## glm(formula = cbind(mortality, survival) ~ dose, family = "binomial", 
##     data = dfShort)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6  
## -2.1186   0.2316   1.0698   0.6495  -0.2699  -0.6634  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.13075    0.44537  -4.784 1.72e-06 ***
## dose         0.30663    0.06822   4.495 6.97e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 31.6755  on 5  degrees of freedom
## Residual deviance:  6.6214  on 4  degrees of freedom
## AIC: 27.415
## 
## Number of Fisher Scoring iterations: 4
# Fit a glm using data in a short format with weights
fitShortP <- glm( mortalityP ~ dose , data = dfShort, weights = nReps , family = "binomial")
summary(fitShortP)
## 
## Call:
## glm(formula = mortalityP ~ dose, family = "binomial", data = dfShort, 
##     weights = nReps)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6  
## -2.1186   0.2316   1.0698   0.6495  -0.2699  -0.6634  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.13075    0.44537  -4.784 1.72e-06 ***
## dose         0.30663    0.06822   4.495 6.97e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 31.6755  on 5  degrees of freedom
## Residual deviance:  6.6214  on 4  degrees of freedom
## AIC: 27.415
## 
## Number of Fisher Scoring iterations: 4
y <- c(0, 1, 0, 1, 0, 1, 0, 1, 0, 2, 1, 2, 0, 1, 1, 0, 1, 5, 1, 1)
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)


# Fit the linear model
summary(lm(y ~ x))
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3677 -0.6145 -0.2602  0.4297  3.4805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.15263    0.50312   0.303   0.7651  
## x            0.07594    0.04200   1.808   0.0873 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.083 on 18 degrees of freedom
## Multiple R-squared:  0.1537, Adjusted R-squared:  0.1067 
## F-statistic: 3.269 on 1 and 18 DF,  p-value: 0.08733
# Fit the generalized linear model
summary(glm(y ~ x, family = "poisson"))
## 
## Call:
## glm(formula = y ~ x, family = "poisson")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6389  -0.9726  -0.3115   0.5307   2.1559  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -1.04267    0.60513  -1.723   0.0849 .
## x            0.08360    0.04256   1.964   0.0495 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 23.589  on 19  degrees of freedom
## Residual deviance: 19.462  on 18  degrees of freedom
## AIC: 52.17
## 
## Number of Fisher Scoring iterations: 5
# Often, we want to "look" at our data and trends in our data
# ggplot2 allows us to add trend lines to our data
# The defult lines are created using a technique called local regression
# However, we can specify different models, including GLMs
# During this exercise, we'll see how to plot a GLM

# Plot the data using jittered points and the default stat_smooth
ggplot(data = dfLong, aes(x = dose, y = mortality)) + 
    geom_jitter(height = 0.05, width = 0.1) +
    stat_smooth(fill = 'pink', color = 'red') 
## `geom_smooth()` using method = 'loess'

# Plot the data using jittered points and the the glm stat_smooth
ggplot(data = dfLong, aes(x = dose, y = mortality)) + 
    geom_jitter(height = 0.05, width = 0.1) +
    stat_smooth(method = 'glm',  method.args = list(family = "binomial"))

# library(lmerTest)

df <- data.frame(dose=rep(rep(c(0, 2, 4, 6, 8, 10), each=20), times=3), 
                 mortality=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1), 
                 replicate=factor(rep(letters[1:3], each=120))
                 )
str(df)
## 'data.frame':    360 obs. of  3 variables:
##  $ dose     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ mortality: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ replicate: Factor w/ 3 levels "a","b","c": 1 1 1 1 1 1 1 1 1 1 ...
glmerOut <- lme4::glmer(mortality ~ dose + (1|replicate), family = 'binomial', data = df)
summary(glmerOut)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: mortality ~ dose + (1 | replicate)
##    Data: df
## 
##      AIC      BIC   logLik deviance df.resid 
##    378.1    389.8   -186.0    372.1      357 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.3484 -0.6875 -0.3031  0.6413  2.1907 
## 
## Random effects:
##  Groups    Name        Variance  Std.Dev.
##  replicate (Intercept) 6.658e-15 8.16e-08
## Number of obs: 360, groups:  replicate, 3
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.38736    0.27334  -8.734   <2e-16 ***
## dose         0.40948    0.04414   9.276   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##      (Intr)
## dose -0.884
# library(lmerTest)
# Fit the model and look at its summary 
# modelOut <- lme4::glmer( cbind(Purchases, Pass) ~ friend + ranking + (1|city), data = allData, family = 'binomial')
# summary( modelOut) 

# Compare outputs to a lmer model
# summary(lme4::lmer( Purchases/( Purchases + Pass) ~ friend + ranking + (1|city), data = allData))


# Run the code to see how to calculate odds ratios
# summary(modelOut) 
# exp(fixef(modelOut)[2])
# exp(confint(modelOut)[3, ])


# Load lmerTest
# library(lmerTest)


userGroups <- data.frame(group=factor(rep(rep(LETTERS[1:4], each=10), times=2)), 
                         webpage=factor(rep(c("old", "new"), each=40)), 
                         clicks=c(0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 1, 1, 1, 2, 0, 1, 1, 0, 3, 2, 3, 1, 2, 4, 2, 1, 0, 2, 0, 1, 2, 0, 2, 1, 1, 2, 4, 2, 8, 1, 1, 1, 2, 1, 1, 0, 0, 3, 0, 1, 4, 1, 2, 0, 1, 1, 0, 0, 3, 2, 0, 3, 1, 2, 2, 0, 2, 3, 1, 3, 2, 4, 4, 2, 1, 5, 2)
                         )
str(userGroups)
## 'data.frame':    80 obs. of  3 variables:
##  $ group  : Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...
##  $ webpage: Factor w/ 2 levels "new","old": 2 2 2 2 2 2 2 2 2 2 ...
##  $ clicks : num  0 0 0 0 0 0 2 0 0 0 ...
# Fit a Poisson glmer
summary( lme4::glmer(clicks ~ webpage + (1|group), family = 'poisson', data = userGroups))
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: clicks ~ webpage + (1 | group)
##    Data: userGroups
## 
##      AIC      BIC   logLik deviance df.resid 
##    255.5    262.6   -124.7    249.5       77 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.3999 -0.9104 -0.2340  0.4978  5.6126 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  group  (Intercept) 0.07093  0.2663  
## Number of obs: 80, groups:  group, 4
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   0.5524     0.1797   3.074  0.00211 **
## webpageold   -0.5155     0.1920  -2.685  0.00726 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr)
## webpageold -0.400
# library(lmerTest)


rawIL <- read.csv("./RInputFiles/ILData.csv")
ILdata <- rawIL
str(ILdata)
## 'data.frame':    1808 obs. of  4 variables:
##  $ age   : Factor w/ 4 levels "15_19","20_24",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ county: Factor w/ 47 levels "ALEXANDER","BROWN",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ count : int  0 0 0 5 0 7 0 4 0 12 ...
# Age goes before year
modelOut <- lme4::glmer(count ~ age + year + (year|county), family = 'poisson', data = ILdata)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control
## $checkConv, : Model failed to converge with max|grad| = 0.00144074 (tol =
## 0.001, component 1)
summary(modelOut)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: count ~ age + year + (year | county)
##    Data: ILdata
## 
##      AIC      BIC   logLik deviance df.resid 
##   3215.6   3259.6  -1599.8   3199.6     1800 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4511 -0.0151 -0.0056 -0.0022  4.0053 
## 
## Random effects:
##  Groups Name        Variance Std.Dev. Corr 
##  county (Intercept) 129.9459 11.3994       
##         year          0.0648  0.2546  -1.00
## Number of obs: 1808, groups:  county, 47
## 
## Fixed effects:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.76258    2.13022  -5.052 4.36e-07 ***
## age20_24     -0.04152    0.03690  -1.125    0.261    
## age25_29     -1.16262    0.05290 -21.976  < 2e-16 ***
## age30_34     -2.28278    0.08487 -26.898  < 2e-16 ***
## year          0.32708    0.25422   1.287    0.198    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##          (Intr) a20_24 a25_29 a30_34
## age20_24 -0.008                     
## age25_29 -0.006  0.341              
## age30_34 -0.004  0.213  0.148       
## year     -0.764  0.000  0.000  0.000
## convergence code: 0
## Model failed to converge with max|grad| = 0.00144074 (tol = 0.001, component 1)
# Extract out fixed effects
lme4::fixef(modelOut)
##  (Intercept)     age20_24     age25_29     age30_34         year 
## -10.76258497  -0.04151848  -1.16262225  -2.28277972   0.32708039
# Extract out random effects 
lme4::ranef(modelOut)
## $county
##            (Intercept)         year
## ALEXANDER   -0.2847724  0.006331741
## BROWN       -0.2847724  0.006331741
## CALHOUN     -0.2847724  0.006331741
## CARROLL     12.2418514 -0.260423999
## CASS        -0.2847724  0.006331741
## CLARK       12.2137668 -0.268553354
## CLAY        -0.2847724  0.006331741
## CRAWFORD    12.5037407 -0.265752695
## CUMBERLAND  -0.2847724  0.006331741
## DE WITT     12.7456078 -0.277675211
## DOUGLAS     13.0751590 -0.306329903
## EDGAR       12.3642794 -0.283606045
## EDWARDS     -0.2847724  0.006331741
## FAYETTE     12.8094530 -0.273060474
## FORD        -0.2847724  0.006331741
## GALLATIN    -0.2847724  0.006331741
## GREENE      -0.2847724  0.006331741
## HAMILTON    -0.2847724  0.006331741
## HANCOCK     12.8581265 -0.305650287
## HARDIN      -0.2847724  0.006331741
## HENDERSON   -0.2847724  0.006331741
## IROQUOIS    13.1616741 -0.311372907
## JASPER      -0.2847724  0.006331741
## JERSEY      12.9202747 -0.272284048
## JO DAVIESS  12.7409389 -0.289747791
## JOHNSON     -0.2847724  0.006331741
## LAWRENCE    12.3713561 -0.268571236
## MARSHALL    -0.2847724  0.006331741
## MASON       -0.2847724  0.006331741
## MENARD      -0.2180916  0.004849989
## MERCER      12.7534193 -0.271678572
## MOULTRIE    -0.2180916  0.004849989
## PIATT       12.5653132 -0.296687752
## PIKE        12.5310614 -0.259211299
## POPE        -0.2180916  0.004849989
## PULASKI     -0.2180916  0.004849989
## PUTNAM      -0.2180916  0.004849989
## RICHLAND    12.0350865 -0.273928951
## SCHUYLER    -0.2180916  0.004849989
## SCOTT       -0.2180916  0.004849989
## SHELBY      12.5183293 -0.283472292
## STARK       -0.2180916  0.004849989
## UNION       13.1465272 -0.308673332
## WABASH      -0.2180916  0.004849989
## WASHINGTON  -0.2180916  0.004849989
## WAYNE       12.1148896 -0.253234752
## WHITE       -0.2180916  0.004849989
# Run code to see one method for plotting the data
ggplot(data = ILdata, aes(x = year, y = count, group = county)) +
    geom_line() +
    facet_grid(age ~ . ) +
    stat_smooth( method = 'glm',
                method.args = list( family = "poisson"), se = FALSE,
                alpha = 0.5) +
    theme_minimal()


Chapter 4 - Repeated Measures

An introduction to repeated measures:

  • Sampling the same thing over time is a repeated measure, a specific example of a mixed effects model
    • Follow the same individual through time - cohorts allow for controlling for individuality
    • The paired t-test is often used for assessing a repeated measures dataset - t.test(x1, x2, paired=TRUE) # x1 and x2 need to be the same length and each element needs to be the same individual
  • Repeated measures ANOVA is a conceptual extension of the paired t-test - are the means constant over time
    • anova(lmer(y ~ time + (1|individual)))
    • Can be used with glmer() also
    • Note that degrees of freedom is still an open question - different packages calculate this differently

Sleep study:

  • Applying LMER to the sleep study dataset - impact of drugs on sleep patterns for 10 patients followed over time
    • This is the classic “Student” dataset due to Guinness at the time not allowing its employees to publish
    • Ho will be that the amount of sleep does not vary with the treatments
    • Modeling will be done using a linear mixed model
  • Modeling approach - iteratively;
    • EDA
    • Simple regression
    • Model of interest
    • Extract information from model
    • Visualize final data

Hate in NY state?

  • Change in rate of hate crimes over time by county - available from data.gov for 2010-2016
  • Level of technical details in reporting should vary significantly by audience - blend data in to story for wider audiences, while being reporducible/technical for a scientifc audience

Wrap up:

  • Hiearchical data, mixed effects models, case studies
  • Start with the LME4 documentation for additional explorations and details

Example code includes:

y <- c(0.23, 2.735, -0.038, 6.327, -0.643, 1.69, -1.378, -1.228, -0.252, 2.014, -0.073, 6.101, 0.213, 3.127, -0.29, 8.395, -0.33, 2.735, 0.223, 1.301)
treat <- rep(c("before", "after"), times=10)
x <- rep(letters[1:10], each=2)

# Run a standard, non-paired t-test
t.test(y[treat == "before"], y[treat == "after"], paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  y[treat == "before"] and y[treat == "after"]
## t = -3.9043, df = 9.5409, p-value = 0.003215
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.594744 -1.512256
## sample estimates:
## mean of x mean of y 
##   -0.2338    3.3197
# Run a standard, paired t-test
t.test(y[treat == "before"], y[treat == "after"], paired = TRUE)
## 
##  Paired t-test
## 
## data:  y[treat == "before"] and y[treat == "after"]
## t = -4.2235, df = 9, p-value = 0.002228
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.456791 -1.650209
## sample estimates:
## mean of the differences 
##                 -3.5535
library(lmerTest)
## Loading required package: lme4
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
## 
## Attaching package: 'lmerTest'
## The following object is masked from 'package:lme4':
## 
##     lmer
## The following object is masked from 'package:stats':
## 
##     step
library(lme4)

# Run the paired-test like before
t.test(y[treat == "before"], y[treat == "after"], paired = TRUE)
## 
##  Paired t-test
## 
## data:  y[treat == "before"] and y[treat == "after"]
## t = -4.2235, df = 9, p-value = 0.002228
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.456791 -1.650209
## sample estimates:
## mean of the differences 
##                 -3.5535
# Run a repeated-measures ANOVA
anova(lmer( y ~ treat + (1|x)))
## Type III Analysis of Variance Table with Satterthwaite's method
##       Sum Sq Mean Sq NumDF DenDF F value   Pr(>F)   
## treat 63.137  63.137     1     9  17.838 0.002228 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
data(sleepstudy, package="lme4")
str(sleepstudy)
## 'data.frame':    180 obs. of  3 variables:
##  $ Reaction: num  250 259 251 321 357 ...
##  $ Days    : num  0 1 2 3 4 5 6 7 8 9 ...
##  $ Subject : Factor w/ 18 levels "308","309","310",..: 1 1 1 1 1 1 1 1 1 1 ...
# Plot the data
ggplot(data = sleepstudy) +
    geom_line(aes(x = Days, y = Reaction, group = Subject)) +
    stat_smooth(aes(x = Days, y = Reaction), method = 'lm', size = 3, se = FALSE)

# Build a lm 
lm( Reaction ~ Days, data = sleepstudy)
## 
## Call:
## lm(formula = Reaction ~ Days, data = sleepstudy)
## 
## Coefficients:
## (Intercept)         Days  
##      251.41        10.47
# Build a lmer
(lmerOut <- lmer( Reaction ~ Days + (1|Subject), data = sleepstudy))
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: Reaction ~ Days + (1 | Subject)
##    Data: sleepstudy
## REML criterion at convergence: 1786.465
## Random effects:
##  Groups   Name        Std.Dev.
##  Subject  (Intercept) 37.12   
##  Residual             30.99   
## Number of obs: 180, groups:  Subject, 18
## Fixed Effects:
## (Intercept)         Days  
##      251.41        10.47
# The lmer model you built during the previous exercise has been saved as lmerOut
# During this exercise, you will examine the effects of drug type using both an ANOVA framework and a regression framework

# Run an anova
anova(lmerOut)
## Type III Analysis of Variance Table with Satterthwaite's method
##      Sum Sq Mean Sq NumDF DenDF F value    Pr(>F)    
## Days 162703  162703     1   161   169.4 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Look at the regression coefficients
summary(lmerOut)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Reaction ~ Days + (1 | Subject)
##    Data: sleepstudy
## 
## REML criterion at convergence: 1786.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2257 -0.5529  0.0109  0.5188  4.2506 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Subject  (Intercept) 1378.2   37.12   
##  Residual              960.5   30.99   
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept) 251.4051     9.7467  22.8102   25.79   <2e-16 ***
## Days         10.4673     0.8042 161.0000   13.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.371
# Read in NY hate data
rawHate <- read.csv("./RInputFiles/hateNY.csv")
hate <- rawHate
str(hate)
## 'data.frame':    233 obs. of  4 variables:
##  $ Year          : int  2010 2011 2012 2013 2014 2015 2016 2013 2010 2011 ...
##  $ County        : Factor w/ 59 levels "Albany","Allegany",..: 1 1 1 1 1 1 1 2 3 3 ...
##  $ TotalIncidents: int  13 7 5 3 3 3 3 1 22 11 ...
##  $ Year2         : int  0 1 2 3 4 5 6 3 0 1 ...
ggplot( data = hate, aes(x = Year, y = TotalIncidents, group = County)) +
    geom_line() + 
    geom_smooth(method = 'lm', se = FALSE)

# During this exercise, you will build a glmer
# Because most of the incidents are small count values, use a Poisson (R function family poisson) error term
# First, build a model using the actually year (variable Year, such as 2006, 2007, etc) - this model will fail
# Second, build a model using the rescaled year (variable Year2, such as 0, 1, 2, etc)
# This demonstrates the importance of considering where the intercept is located when building regression models
# Recall that a variable x can be both a fixed and random effect in a lmer() or glmer(): for example lmer(y ~ x + (x| group) demonstrates this syntax

# glmer with raw Year
glmer(TotalIncidents ~ Year + (Year|County), data = hate, family = "poisson")
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
## control$checkConv, : Model failed to converge with max|grad| = 0.370207
## (tol = 0.001, component 1)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables?
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: TotalIncidents ~ Year + (Year | County)
##    Data: hate
##       AIC       BIC    logLik  deviance  df.resid 
## 1165.2746 1182.5298 -577.6373 1155.2746       228 
## Random effects:
##  Groups Name        Std.Dev. Corr 
##  County (Intercept) 217.8915      
##         Year          0.1084 -1.00
## Number of obs: 233, groups:  County, 59
## Fixed Effects:
## (Intercept)         Year  
##    295.4814      -0.1464  
## convergence code 0; 3 optimizer warnings; 0 lme4 warnings
# glmer with scaled Year, Year2
glmerOut <- glmer(TotalIncidents ~ Year2 + (Year2|County), data = hate, family = "poisson")
summary(glmerOut)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: TotalIncidents ~ Year2 + (Year2 | County)
##    Data: hate
## 
##      AIC      BIC   logLik deviance df.resid 
##   1165.3   1182.5   -577.6   1155.3      228 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.5434 -0.4864 -0.1562  0.3319  3.1939 
## 
## Random effects:
##  Groups Name        Variance Std.Dev. Corr
##  County (Intercept) 1.16291  1.0784       
##         Year2       0.01175  0.1084   0.02
## Number of obs: 233, groups:  County, 59
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.27952    0.16600   7.708 1.28e-14 ***
## Year2       -0.14622    0.03324  -4.398 1.09e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##       (Intr)
## Year2 -0.338
# Extract and manipulate data
countyTrend <- ranef(glmerOut)$County
countyTrend$county <- factor(row.names(countyTrend), levels =row.names(countyTrend)[order(countyTrend$Year2)])

# Plot results 
ggplot(data = countyTrend, aes(x = county, y = Year2)) + geom_point() +
    coord_flip() + 
    ylab("Change in hate crimes per year")  +
    xlab("County")


Forecasting Product Demand in R

Chapter 1 - Forecasting Demand with Time Series

Loading data in to an xts object:

  • The xts object will be the buidling block for the course - extensible time series (xts) is an extension of the zoo package - basically, a time index attached to the data matrix
  • Can create dates using dates=seq(as.Date(“MM-DD-YYYY”), length=, by=“weeks”) # to create weekly data
    • xts(myData, order.by=dates) # will create an XTS using dates as the index

ARIMA Time Series 101:

  • AR - AutoRegressive (lags help to determine today’s values - “long memory models”)
  • MA - Moving Average (shocks/errors help to determine today’s shocks/errors - “short memory models” due to dissipation)
  • I - Integrated (does the data have a dependency across time, and how long does it last) - make the time series stationary
    • Stationarity is the idea that effects disipate over time - today has more impact on tomorrow than on time periods in the future
    • Differencing (monthly, seasonal, etc.) the data can be a useful approach for data with stationarity
  • Begin by creating training dataset and valiadation training dataset
  • The auto.arima() function tries to estimate the best ARIMA for a given data series
    • ARIMA(p, d, q) is ARIMA(AR, Differencing, MA)

Forecasting and Evaluating:

  • Can use the ARIMA data to forecast the data forward - extrapolating the signal (forecasting) and estimating the amount of noise (error or CI)
  • The forecast() function in R simplifies the process - forecast(myModel, h=) which will forecast forward h time periods
  • Two common error measurements include MAE (mean average error) and MAPE (mean average percentage error)
    • MAPE is better at putting things on a common scale

Example code includes:

# Read in beverages data
rawBev <- read.csv("./RInputFiles/Bev.csv")
bev <- rawBev
str(bev)
## 'data.frame':    176 obs. of  14 variables:
##  $ M.hi.p  : num  59.2 56.3 56.3 49.3 61.3 ...
##  $ M.lo.p  : num  29.2 26.3 26.2 26.1 25.9 ...
##  $ MET.hi.p: num  63.7 60.3 60.8 55.1 65.1 ...
##  $ MET.lo.p: num  26 25.5 25.7 26.5 25.7 ...
##  $ MET.sp.p: num  50.1 48.8 48.6 47.7 50.8 ...
##  $ SEC.hi.p: num  58.6 54.6 57.9 49.7 63.7 ...
##  $ SEC.lo.p: num  29.2 26.3 26.2 26.1 25.9 ...
##  $ M.hi    : int  458 477 539 687 389 399 392 417 568 583 ...
##  $ M.lo    : int  1455 1756 2296 3240 2252 1901 1939 1999 1798 1558 ...
##  $ MET.hi  : int  2037 1700 1747 2371 1741 2072 2353 2909 3204 2395 ...
##  $ MET.lo  : int  3437 3436 3304 3864 3406 3418 3553 3376 3233 3262 ...
##  $ MET.sp  : int  468 464 490 657 439 453 423 408 501 481 ...
##  $ SEC.hi  : int  156 151 178 217 141 149 134 148 195 170 ...
##  $ SEC.lo  : int  544 624 611 646 624 610 623 599 551 539 ...
# Load xts package 
library(xts)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(forecast)
## 
## Attaching package: 'forecast'
## The following object is masked from 'package:ggplot2':
## 
##     autolayer
# Create the dates object as an index for your xts object
dates <- seq(as.Date("2014-01-19"), length = 176, by = "weeks")

# Create an xts object called bev_xts
bev_xts <- xts(bev, order.by = dates)


# Create the individual region sales as their own objects
MET_hi <- bev_xts[,"MET.hi"]
MET_lo <- bev_xts[,"MET.lo"]
MET_sp <- bev_xts[,"MET.sp"]

# Sum the region sales together
MET_t <- MET_hi + MET_lo + MET_sp

# Plot the metropolitan region total sales
plot(MET_t)

# Split the data into training and validation
MET_t_train <- MET_t[index(MET_t) < "2017-01-01"]
MET_t_valid <- MET_t[index(MET_t) >= "2017-01-01"]

# Use auto.arima() function for metropolitan sales
MET_t_model <- auto.arima(MET_t_train)


# Forecast the first 22 weeks of 2017
forecast_MET_t <- forecast(MET_t_model, h = 22)

# Plot this forecast #
plot(forecast_MET_t)

# Convert to numeric for ease
for_MET_t <- as.numeric(forecast_MET_t$mean)
v_MET_t <- as.numeric(MET_t_valid)

# Calculate the MAE
MAE <- mean(abs(for_MET_t - v_MET_t))

# Calculate the MAPE
MAPE <- 100*mean(abs(for_MET_t - v_MET_t)/v_MET_t)

# Print to see how good your forecast is!
print(MAE)
## [1] 898.8403
print(MAPE)
## [1] 17.10332
# Convert your forecast to an xts object
for_dates <- seq(as.Date("2017-01-01"), length = 22, by = "weeks")
for_MET_t_xts <- xts(forecast_MET_t$mean, order.by = for_dates)

# Plot the validation data set
plot(for_MET_t_xts, main = 'Forecast Comparison', ylim = c(4000, 8500))

# Overlay the forecast of 2017
lines(MET_t_valid, col = "blue")

# Plot the validation data set
plot(MET_t_valid, main = 'Forecast Comparison', ylim = c(4000, 8500))

# Overlay the forecast of 2017
lines(for_MET_t_xts, col = "blue")

# Convert the limits to xts objects
lower <- xts(forecast_MET_t$lower[, 2], order.by = for_dates)
upper <- xts(forecast_MET_t$upper[, 2], order.by = for_dates)

# Adding confidence intervals of forecast to plot
lines(lower, col = "blue", lty = "dashed")

lines(upper, col = "blue", lty = "dashed")


Chapter 2 - Components of Demand

Price elasticity:

  • Price is one of the obvious factors that impacts demand, with the relationship called price elasticity (% dDemand / % dPrice)
    • Elastic goods have elasticity > 1, meaning demand changes more quickly (percentage wise) than price
    • Inelastic goods have elasticity < 1, for example gasoline
    • Unit elastic goods have elasticity = 1, meaning that X% increase in price drives X% decrease in demand
    • Linear regression can be employed to estimate the elasticity for a given product - the log-log transform helps get the % vs % coefficients

Seasonal/holiday/promotional effects:

  • Seasonal products are common - can be bought any time of the year, though certain seasons have higher demand (holidays are a common example)
  • Promotions are attempts by companies to influence demand
  • Linear regression can help determine relationships between demand and many other factors
    • If an xts vector has been created for key dates, can merge(train, holiday, fill=0) and the holiday column will be 0 wherever there is no match to holiday

Forecasting with regression:

  • Forecasting with time series is straightforward due to the lag nature of the models - tomorrow forecasts today and today forecasts tomorrow and etc.
  • Forecasting with regression can be more tricky, particularly since we need the future inputs (such as price) in order to predict the future demand
    • Even when there are contractually fixed prices, promotions can effectively create a de facto price change anyways
  • Need to have the same column names in the test/validation dataset as were used in the modeling
    • Then, can use predict(myModel, myData)
    • May need to exponentiate in case the data are currently on the log scale rather than the absolute scale

Example code includes:

bev_xts_train <- bev_xts[index(bev_xts) < "2017-01-01"]
bev_xts_valid <- bev_xts[index(bev_xts) >= "2017-01-01"]

# Save the prices of each product
l_MET_hi_p <- log(as.vector(bev_xts_train[, "MET.hi.p"]))

# Save as a data frame
MET_hi_train <- data.frame(as.vector(log(MET_hi[index(MET_hi) < "2017-01-01"])), l_MET_hi_p)
colnames(MET_hi_train) <- c("log_sales", "log_price")

# Calculate the regression
model_MET_hi <- lm(log_sales ~ log_price, data = MET_hi_train)


# Plot the product's sales
plot(MET_hi)

# Plot the product's price
MET_hi_p <- bev_xts_train[, "MET.hi.p"]
plot(MET_hi_p)

# Create date indices for New Year's week
n.dates <- as.Date(c("2014-12-28", "2015-12-27", "2016-12-25"))

# Create xts objects for New Year's
newyear <- as.xts(rep(1, 3), order.by = n.dates)

# Create sequence of dates for merging
dates_train <- seq(as.Date("2014-01-19"), length = 154, by = "weeks")

# Merge training dates into New Year's object
newyear <- merge(newyear, dates_train, fill = 0)


# Add newyear variable to your data frame
MET_hi_train <- data.frame(MET_hi_train, newyear=as.vector(newyear))

# Build regressions for the product
model_MET_hi_full <- lm(log_sales ~ log_price + newyear, data = MET_hi_train)


# Subset the validation prices #
l_MET_hi_p_valid <- log(as.vector(bev_xts_valid[, "MET.hi.p"]))

# Create a validation data frame #
MET_hi_valid <- data.frame(l_MET_hi_p_valid)
colnames(MET_hi_valid) <- "log_price"


# Predict the log of sales for your high end product
pred_MET_hi <- predict(model_MET_hi, MET_hi_valid)

# Convert predictions out of log scale
pred_MET_hi <- exp(pred_MET_hi)


# Convert to an xts object
dates_valid <- seq(as.Date("2017-01-01"), length = 22, by = "weeks")
pred_MET_hi_xts <- xts(pred_MET_hi, order.by = dates_valid)

# Plot the forecast
plot(pred_MET_hi_xts)

# Calculate and print the MAPE
MET_hi_v <- bev_xts_valid[,"MET.hi"]

MAPE <- 100*mean(abs((pred_MET_hi_xts - MET_hi_v)/MET_hi_v))
print(MAPE)
## [1] 29.57455

Chapter 3 - Blending Regression with Time Series

Residuals from regression model:

  • The residuals from the regression models can be used for further modeling - see if the residuals are related over time, and model them with time series if so
  • Need to start by gathering the residuals and then converting them to an XTS object - explore for patterns in this XTS object

Forecasting residuals:

  • When the residuals are related across time, we can use time series to model the residuals - basically, patterns to the errors provide an opportunity for further modeling
  • Can use auto.arima() on the residuals data, to see what the best ARIMA model for the residuals is
    • Can then forecast the residuals in to the future using forecast(myModel, h=) # h being the time periods to predict forward

Transfer functions and ensembling:

  • Techniques for combining forecasts - single model (transfer function) or averaging of models (ensembling)
  • Demand can be based on both regression (modeling external factors) and time series (residuals)
  • Ensembling is a combination (blend) of the forecasts, with simple averaging being the simplest approach
    • Basically, build a stand-alone time series model and a stand-alone regression model
    • The ensemble forecast can be better or worse than any of the stand-alone models

Example code includes:

# Calculate the residuals from the model
MET_hi_full_res <- resid(model_MET_hi_full)

# Convert the residuals to an xts object
MET_hi_full_res <- xts(MET_hi_full_res, order.by = dates_train)


# Plot the histogram of the residuals
hist(MET_hi_full_res)

# Plot the residuals over time
plot(MET_hi_full_res)

# Build an ARIMA model on the residuals: MET_hi_arima
MET_hi_arima <- auto.arima(MET_hi_full_res)

# Look at a summary of the model
summary(MET_hi_arima)
## Series: MET_hi_full_res 
## ARIMA(2,0,2) with zero mean 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2
##       1.5736  -0.7833  -0.8149  0.2865
## s.e.  0.0992   0.0758   0.1266  0.0941
## 
## sigma^2 estimated as 0.03921:  log likelihood=32.19
## AIC=-54.37   AICc=-53.97   BIC=-39.19
## 
## Training set error measures:
##                         ME      RMSE      MAE       MPE     MAPE      MASE
## Training set -0.0004334041 0.1954382 0.145018 -47.18396 223.0022 0.5227844
##                     ACF1
## Training set -0.01034041
# Forecast 22 weeks with your model: for_MET_hi_arima
for_MET_hi_arima <- forecast(MET_hi_arima, h=22)

# Print first 10 observations
head(for_MET_hi_arima$mean, n = 10)
## Time Series:
## Start = 1079 
## End = 1142 
## Frequency = 0.142857142857143 
##  [1] -0.07662326 -0.10617141 -0.10705342 -0.08529747 -0.05037188
##  [6] -0.01245420  0.01985656  0.04100076  0.04896519  0.04493645
# Convert your forecasts into an xts object
dates_valid <- seq(as.Date("2017-01-01"), length = 22, by = "weeks")
for_MET_hi_arima <- xts(for_MET_hi_arima$mean, order.by = dates_valid)

# Plot the forecast
plot(for_MET_hi_arima)

# Convert your residual forecast to the exponential version
for_MET_hi_arima <- exp(for_MET_hi_arima)

# Multiply your forecasts together!
for_MET_hi_final <- for_MET_hi_arima * pred_MET_hi_xts


# Plot the final forecast - don't touch the options!
plot(for_MET_hi_final, ylim = c(1000, 4300))

# Overlay the validation data set
lines(MET_hi_v, col = "blue")

# Calculate the MAE
MAE <- mean(abs(for_MET_hi_final - MET_hi_v))
print(MAE)
## [1] 481.6678
# Calculate the MAPE
MAPE <- 100 * mean(abs(for_MET_hi_final - MET_hi_v)/MET_hi_v)
print(MAPE)
## [1] 28.82836
# Build an ARIMA model using the auto.arima function
MET_hi_model_arima <- auto.arima(MET_hi)

# Forecast the ARIMA model
for_MET_hi <- forecast(MET_hi_model_arima, h = length(MET_hi_v))

# Save the forecast as an xts object
dates_valid <- seq(as.Date("2017-01-01"), length = 22, by = "weeks")
for_MET_hi_xts <- xts(for_MET_hi$mean, order.by = dates_valid)

# Calculate the MAPE of the forecast
MAPE <- 100 * mean(abs(for_MET_hi_xts - MET_hi_v) / MET_hi_v)
print(MAPE)
## [1] 36.95411
# Ensemble the two forecasts together
for_MET_hi_en <- 0.5 * (for_MET_hi_xts + pred_MET_hi_xts)

# Calculate the MAE and MAPE
MAE <- mean(abs(for_MET_hi_en - MET_hi_v))
print(MAE)
## [1] 533.8911
MAPE <- 100 * mean(abs(for_MET_hi_en - MET_hi_v) / MET_hi_v)
print(MAPE)
## [1] 32.28549

Chapter 4 - Hierarchical Forecasting

Bottom-Up Hierarchical Forecasting:

  • The hierarchical data structuring can be an advantage in forecasting, provided that the data has a natural hierarchy
  • The sum of all the lower-level forecasts should equal the higher-level forecasts
    • Bottom-up: Forecast at the lowest level and aggregate (easiest but requires the most number of forecasts)
    • Top-down: Forecast at the top level and the apply downwards
    • Middle-out: Forecast at the middle levels and then apply both upwards and downwards

Top-Down Hierarchical Forecasting:

  • The top-down forecasting process is typically quicker but less accurate than the bottom-up forecasting process
  • Two techniques available for top-down reconciliation
    • Average of historical proportions - mean percentage that each component contributes to the total (calculated by sub-component such as week)
    • Proportion of historical averages - mean percentage that each component contributes to the total (calculated by aggregate)
  • Reconciled forecasts at lower levels are typically less accurate than the direct forecast of the lower levels

Middle-Out Hierarchical Forecasting:

  • Bottom-up forecasting is higher quality but more time-consuming than top-down forecasting
  • The middle-out forecasting method is a sometimes successful blend of the methods, getting decent accuracy at a lesser time commitment

Wrap up:

  • Using time series to forecast demand forward
  • Incorporating external factors using linear regression
  • Blending time series and regression approaches
  • Top-down, bottom-up, middle-out approaches to aggregation and forecasting at various levels (hierarchical)
  • Can extend by looking at cross-elasticities (impact of competitor pricing)
  • Can better forecast proportions using time series analysis
  • Additional demand forecasting models include neural networks, exponential smoothing, etc.

Example code includes:

# Build a time series model 
MET_sp_model_arima <- auto.arima(MET_sp)

# Forecast the time series model for 22 periods
for_MET_sp <- forecast(MET_sp_model_arima, h=22)

# Create an xts object
for_MET_sp_xts <- xts(for_MET_sp$mean, order.by=dates_valid)

MET_sp_v <- MET_sp["2017"]

# Calculate the MAPE
MAPE <- 100 * mean(abs(for_MET_sp_xts - MET_sp_v) / MET_sp_v)
print(MAPE)
## [1] 6.703272
MET_sp_train <- bev_xts_train %>%
    transform(log_sales = log(MET.sp), log_price=log(MET.sp.p))
MET_sp_train <- MET_sp_train[, c("log_sales", "log_price")]
MET_sp_train$newyear <- 0
MET_sp_train$valentine <- 0
MET_sp_train$christmas <- 0
MET_sp_train$mother <- 0

MET_sp_train[index(MET_sp_train) %in% as.Date(c("2014-12-28", "2015-12-27", "2016-12-25")), "newyear"] <- 1
MET_sp_train[index(MET_sp_train) %in% as.Date(c("2014-02-09", "2015-02-08", "2016-02-07")), "valentine"] <- 1
MET_sp_train[index(MET_sp_train) %in% as.Date(c("2014-12-21", "2015-12-20", "2016-12-18")), "christmas"] <- 1
MET_sp_train[index(MET_sp_train) %in% as.Date(c("2014-05-04", "2015-05-03", "2016-05-01")), "mother"] <- 1


# THE BELOW IS TOTAL NONSENSE
# Build a regression model
model_MET_sp <- lm(log_sales ~ log_price + newyear + valentine + christmas + mother, data = MET_sp_train)


MET_sp_valid <- as.data.frame(bev_xts_valid) %>%
    mutate(log_sales = log(MET.sp), log_price=log(MET.sp.p)) %>%
    select("log_sales", "log_price")
MET_sp_valid$newyear <- 0
MET_sp_valid$valentine <- 0
MET_sp_valid$christmas <- 0
MET_sp_valid$mother <- 0  

MET_sp_valid[7, "valentine"] <- 1
MET_sp_valid[19, "mother"] <- 1
MET_sp_valid$log_sales <- NULL


# Forecast the regression model using the predict function 
pred_MET_sp <- predict(model_MET_sp, MET_sp_valid)

# Exponentiate your predictions and create an xts object
pred_MET_sp <- exp(pred_MET_sp)
pred_MET_sp_xts <- xts(pred_MET_sp, order.by = dates_valid)

# Calculate MAPE
MAPE <- 100*mean(abs((pred_MET_sp_xts - MET_sp_v)/MET_sp_v))
print(MAPE)
## [1] 6.55473
# Ensemble the two forecasts
for_MET_sp_en <- 0.5 * (for_MET_sp_xts + pred_MET_sp_xts)

# Calculate the MAPE
MAPE <- 100 * mean(abs(for_MET_sp_en - MET_sp_v) / MET_sp_v)
print(MAPE)
## [1] 6.048594
# Copy over pred_MET_lo_xts
pred_MET_lo_xts <- xts(c(2960.6, 2974.1, 2943.2, 2948.6, 2915.6, 2736.4, 2953.9, 3199.4, 2934, 2898.7, 3027.7, 3165.9, 3073.1, 2842.7, 2928.7, 3070.2, 2982.2, 3018, 3031.9, 2879.4, 2993.2, 2974.1), order.by=dates_valid)


# Calculate the metropolitan regional sales forecast
for_MET_total <- pred_MET_hi_xts + for_MET_sp_en + pred_MET_lo_xts

# Calculate a validation data set 
MET_t_v <- bev_xts_valid[,"MET.hi"] + bev_xts_valid[,"MET.lo"] + bev_xts_valid[,"MET.sp"]

# Calculate the MAPE
MAPE <- 100 * mean(abs(for_MET_total - MET_t_v) / MET_t_v)
print(MAPE)
## [1] 10.61952
# Create the MET_total data
MET_total <- xts(data.frame(MET.hi=c(5942, 5600, 5541, 6892, 5586, 5943, 6329, 6693, 6938, 6138, 6361, 6378, 5423, 5097, 4937, 5496, 6870, 6626, 6356, 5657, 6577, 7202, 7381, 7404, 7204, 6667, 6153, 6035, 5633, 5283, 5178, 4758, 5058, 5254, 5954, 6166, 6247, 6304, 7202, 6662, 6814, 6174, 5412, 5380, 5674, 6472, 6912, 7404, 8614, 8849, 7174, 6489, 7174, 6555, 6402, 7671, 5012, 4790, 5075, 5238, 5615, 6113, 7706, 7811, 7898, 7232, 6585, 5870, 7084, 5125, 5330, 5553, 6349, 6195, 6271, 5851, 5333, 5854, 5609, 5649, 6051, 6409, 5786, 5190, 5085, 4949, 5151, 5147, 5426, 5509, 6956, 7870, 8224, 6685, 6153, 5802, 5244, 5162, 5036, 5025, 8378, 8944, 7109, 7605, 7846, 7598, 8012, 9551, 6102, 5366, 4932, 4962, 5392, 6194, 7239, 7621, 7460, 7097, 6596, 5848, 8306, 5344, 5848, 6341, 7364, 7269, 7053, 6682, 6971, 7521, 7063, 6298, 6003, 5227, 5047, 4877, 4851, 4628, 4516, 4442, 4935, 5181, 5431, 5866, 5919, 5704, 5957, 6019, 5962, 6021, 5880, 5674, 7439, 7415)),
                 order.by=dates_train
                 )

# Build a regional time series model
MET_t_model_arima <- auto.arima(MET_total)

# Calculate a 2017 forecast for 22 periods
for_MET_t <- forecast(MET_t_model_arima, h=22)

# Make an xts object from your forecast
for_MET_t_xts <- xts(for_MET_t$mean, order.by=dates_valid)

# Calculate the MAPE
MAPE <- 100 * mean(abs(for_MET_t_xts - MET_t_v) / MET_t_v)
print(MAPE)
## [1] 17.10332
# Calculate the average historical proportions
prop_hi <- mean(MET_hi/MET_total)
prop_lo <- mean(MET_lo/MET_total)
prop_sp <- mean(MET_sp/MET_total)

# Distribute out your forecast to each product
for_prop_hi <- prop_hi*for_MET_t_xts
for_prop_lo <- prop_lo*for_MET_t_xts
for_prop_sp <- prop_sp*for_MET_t_xts

# Calculate the MAPE's for each product
MAPE_hi <- 100 * mean(abs(for_prop_hi - MET_hi_v) / MET_hi_v)
print(MAPE_hi)
## [1] 38.7318
MET_lo_v <- bev_xts_valid[,"MET.lo"]
MAPE_lo <- 100 * mean(abs(for_prop_lo - MET_lo_v) / MET_lo_v)
print(MAPE_lo)
## [1] 10.70649
MAPE_sp <- 100 * mean(abs(for_prop_sp - MET_sp_v) / MET_sp_v)
print(MAPE_sp)
## [1] 6.232888
# Calculate the average historical proportions
prop_hi_2 <- mean(MET_hi) / mean(MET_total)
prop_lo_2 <- mean(MET_lo) / mean(MET_total)
prop_sp_2 <- mean(MET_sp) / mean(MET_total)

# Distribute out your forecast to each product
for_prop_hi_2 <- prop_hi_2 * for_MET_t_xts
for_prop_lo_2 <- prop_lo_2 * for_MET_t_xts
for_prop_sp_2 <- prop_sp_2 * for_MET_t_xts

# Calculate the MAPE's for each product
MAPE_hi <- 100 * mean(abs(for_prop_hi_2 - MET_hi_v) / MET_hi_v)
print(MAPE_hi)
## [1] 38.33559
MAPE_lo <- 100 * mean(abs(for_prop_lo_2 - MET_lo_v) / MET_lo_v)
print(MAPE_lo)
## [1] 8.450784
MAPE_sp <- 100 * mean(abs(for_prop_sp_2 - MET_sp_v) / MET_sp_v)
print(MAPE_sp)
## [1] 6.517045
SEC_total <- xts(data.frame(SEC.hi=c(700, 775, 789, 863, 765, 759, 757, 747, 746, 709, 749, 786, 796, 726, 727, 723, 778, 755, 739, 740, 723, 695, 727, 707, 725, 684, 667, 698, 727, 722, 748, 695, 742, 739, 715, 724, 686, 671, 688, 682, 710, 700, 672, 680, 695, 780, 751, 693, 809, 881, 703, 712, 768, 796, 808, 904, 641, 662, 693, 725, 719, 736, 715, 722, 732, 745, 689, 705, 811, 739, 744, 700, 745, 735, 732, 722, 721, 732, 750, 714, 752, 677, 731, 674, 720, 675, 741, 722, 715, 719, 649, 697, 743, 733, 772, 698, 690, 734, 713, 644, 788, 833, 749, 731, 670, 675, 675, 993, 773, 751, 697, 677, 750, 723, 780, 763, 721, 701, 704, 684, 985, 791, 731, 714, 704, 694, 685, 652, 708, 754, 747, 705, 711, 699, 712, 745, 706, 665, 666, 692, 676, 696, 689, 697, 689, 717, 697, 708, 660, 707, 715, 680, 922, 888)), order.by=dates_train
                 )

# Build a time series model for the region
SEC_t_model_arima <- auto.arima(SEC_total)

# Forecast the time series model
for_SEC_t <- forecast(SEC_t_model_arima, h=22)

# Make into an xts object
for_SEC_t_xts <- xts(for_SEC_t$mean, order.by=dates_valid)

SEC_t_v <- bev_xts_valid$SEC.hi + bev_xts_valid$SEC.lo
# Calculate the MAPE
MAPE <- 100 * mean(abs(for_SEC_t_xts - SEC_t_v) / SEC_t_v)
print(MAPE)
## [1] 4.742324
SEC_hi <- bev_xts_train[, "SEC.hi"]
SEC_lo <- bev_xts_train[, "SEC.lo"]
SEC_hi_v <- bev_xts_valid[, "SEC.hi"]
SEC_lo_v <- bev_xts_valid[, "SEC.lo"]

# Calculate the average of historical proportions
prop_hi <- mean(SEC_hi / SEC_total)
prop_lo <- mean(SEC_lo / SEC_total)

# Distribute the forecast
for_prop_hi <- prop_hi * for_SEC_t_xts
for_prop_lo <- prop_lo * for_SEC_t_xts

# Calculate a MAPE for each product
MAPE_hi <- 100 * mean(abs(for_prop_hi - SEC_hi_v) / SEC_hi_v)
print(MAPE_hi)
## [1] 7.988508
MAPE_lo <- 100 * mean(abs(for_prop_lo - SEC_lo_v) / SEC_lo_v)
print(MAPE_lo)
## [1] 5.202529
# Copy over for_M_t_xts data
for_M_t_xts <- xts(c(2207, 2021, 2010, 2052, 2075, 2074, 2065, 2058, 2056, 2055, 2053, 2052, 2050, 2049, 2048, 2047, 2046, 2045, 2044, 2043, 2043, 2042), order.by=dates_valid)

# Calculate the state sales forecast: for_state
for_state = for_SEC_t_xts + for_MET_t_xts + for_M_t_xts

# See the forecasts
for_state
##                [,1]
## 2017-01-01 9996.689
## 2017-01-08 9525.915
## 2017-01-15 9342.760
## 2017-01-22 9269.321
## 2017-01-29 9214.912
## 2017-02-05 9162.005
## 2017-02-12 9118.199
## 2017-02-19 9087.859
## 2017-02-26 9070.209
## 2017-03-05 9058.715
## 2017-03-12 9049.677
## 2017-03-19 9043.959
## 2017-03-26 9038.794
## 2017-04-02 9035.673
## 2017-04-09 9033.250
## 2017-04-16 9031.296
## 2017-04-23 9029.656
## 2017-04-30 9028.227
## 2017-05-07 9026.939
## 2017-05-14 9025.746
## 2017-05-21 9025.617
## 2017-05-28 9024.530

HR Analytics in R: Exploring Employee Data

Chapter 1 - Identifying the Best Recruiting Source

Introduction - Ben Teusch, HR Analytics Consultant:

  • HR analytics has many other names - people analytics, workforce analytics, etc.
  • Identify groups for comparison - high vs. low performers, groups with high vs. low turnover, etc.
    • Exploratory analysis and statistics for each group, including plots of key differences
  • Course is outlines as a series of case studies, with one case per chapter

Recruiting and quality of hire:

  • Where are the best hires coming from, and how can you get more of them
    • Defining quality of hire is challenging - some mix of productivity, satisfaction, retention, performance reviews, etc.
    • Attrition can be defined as the mean of a 1, 0 vector of “did the person leave in the time period T”

Visualizing recruiting data:

  • Helpful for communicating findings to decision makers
  • The geom_col() in ggplot will make a bar chart, with the y aestehtic being the bar height

Example code includes:

# Import the recruitment data
recruitment <- readr::read_csv("./RInputFiles/recruitment_data.csv")
## Parsed with column specification:
## cols(
##   attrition = col_integer(),
##   performance_rating = col_integer(),
##   sales_quota_pct = col_double(),
##   recruiting_source = col_character()
## )
# Look at the first few rows of the dataset
head(recruitment)
## # A tibble: 6 x 4
##   attrition performance_rating sales_quota_pct recruiting_source
##       <int>              <int>           <dbl> <chr>            
## 1         1                  3           1.09  Applied Online   
## 2         0                  3           2.39  <NA>             
## 3         1                  2           0.498 Campus           
## 4         0                  2           2.51  <NA>             
## 5         0                  3           1.42  Applied Online   
## 6         1                  3           0.548 Referral
# Get an overview of the recruitment data
summary(recruitment)
##    attrition     performance_rating sales_quota_pct   recruiting_source 
##  Min.   :0.000   Min.   :1.000      Min.   :-0.7108   Length:446        
##  1st Qu.:0.000   1st Qu.:2.000      1st Qu.: 0.5844   Class :character  
##  Median :0.000   Median :3.000      Median : 1.0701   Mode  :character  
##  Mean   :0.213   Mean   :2.895      Mean   : 1.0826                     
##  3rd Qu.:0.000   3rd Qu.:3.000      3rd Qu.: 1.5325                     
##  Max.   :1.000   Max.   :5.000      Max.   : 3.6667
# See which recruiting sources the company has been using
recruitment %>% 
  count(recruiting_source)
## # A tibble: 5 x 2
##   recruiting_source     n
##   <chr>             <int>
## 1 Applied Online      130
## 2 Campus               56
## 3 Referral             45
## 4 Search Firm          10
## 5 <NA>                205
# Find the average sales quota attainment for each recruiting source
avg_sales <- recruitment %>% 
  group_by(recruiting_source) %>% 
  summarize(avg_sales_quota_pct=mean(sales_quota_pct))

# Display the result
avg_sales
## # A tibble: 5 x 2
##   recruiting_source avg_sales_quota_pct
##   <chr>                           <dbl>
## 1 Applied Online                  1.06 
## 2 Campus                          0.908
## 3 Referral                        1.02 
## 4 Search Firm                     0.887
## 5 <NA>                            1.17
# Find the average attrition for the sales team, by recruiting source, sorted from lowest attrition rate to highest
avg_attrition <- recruitment %>%
  group_by(recruiting_source) %>% 
  summarize(attrition_rate=mean(attrition)) %>%
  arrange(attrition_rate)

# Display the result
avg_attrition
## # A tibble: 5 x 2
##   recruiting_source attrition_rate
##   <chr>                      <dbl>
## 1 <NA>                       0.132
## 2 Applied Online             0.246
## 3 Campus                     0.286
## 4 Referral                   0.333
## 5 Search Firm                0.500
# Plot the bar chart
avg_sales %>% ggplot(aes(x=recruiting_source, y=avg_sales_quota_pct)) + geom_col()

# Plot the bar chart
avg_attrition %>% ggplot(aes(x=recruiting_source, y=attrition_rate)) + geom_col()


Chapter 2 - What is driving low employee engagement

Analyzing employee engagement:

  • Gallup defines engaged employees as those who are involved in, enthusiastic about, and committed to their workplace
  • Survey data are available in the example case study
    • Will use both mutate() and ifelse()
    • The ifelse() is needed for vectors of length > 1 since it can work in a vectorized manner (and is thus OK inside the mutate call)

Visualizing the engagement data:

  • Multiple attributes in a single place can make for a more compelling report
  • The tidyr package is part of the tidyverse, and hslps arrange the data properly for plotting
    • tidyr::gather(columns, key=“key”, value=“value”) will be the package used in this example - pull the data from the columns down to the rows
    • ggplot(survey_gathered, aes(x = key, y = value, fill = department)) + geom_col(position = “dodge”)
    • ggplot(survey_gathered, aes(x = key, y = value, fill = department)) + geom_col(position = “dodge”) + facet_wrap(~ key, scales = “free”)

Are differences meaningful?

  • Can use significance testing to assess likelhood (p-value) that the second sample could have come from the same population as the first sample
    • This course will use t-test (continuous variables) and chi-squared test (categorical variables)
    • t.test(tenure ~ is_manager, data = survey)
    • chisq.test(survey\(left_company, survey\)is_manager) # no data= argument is available in the function

Example code includes:

# Import the data
survey <- readr::read_csv("./RInputFiles/survey_data.csv")
## Parsed with column specification:
## cols(
##   employee_id = col_integer(),
##   department = col_character(),
##   engagement = col_integer(),
##   salary = col_double(),
##   vacation_days_taken = col_integer()
## )
# Get an overview of the data
summary(survey)
##   employee_id      department          engagement       salary      
##  Min.   :   1.0   Length:1470        Min.   :1.00   Min.   : 45530  
##  1st Qu.: 491.2   Class :character   1st Qu.:3.00   1st Qu.: 59407  
##  Median :1020.5   Mode  :character   Median :3.00   Median : 70481  
##  Mean   :1024.9                      Mean   :3.05   Mean   : 74162  
##  3rd Qu.:1555.8                      3rd Qu.:4.00   3rd Qu.: 84763  
##  Max.   :2068.0                      Max.   :5.00   Max.   :164073  
##  vacation_days_taken
##  Min.   : 0.00      
##  1st Qu.: 6.00      
##  Median :10.00      
##  Mean   :11.27      
##  3rd Qu.:16.00      
##  Max.   :38.00
# Examine the counts of the department variable
survey %>% count(department)
## # A tibble: 3 x 2
##   department      n
##   <chr>       <int>
## 1 Engineering   961
## 2 Finance        63
## 3 Sales         446
# Output the average engagement score for each department, sorted
survey %>%
  group_by(department) %>%
  summarize(avg_engagement=mean(engagement)) %>%
  arrange(avg_engagement)
## # A tibble: 3 x 2
##   department  avg_engagement
##   <chr>                <dbl>
## 1 Sales                 2.81
## 2 Engineering           3.15
## 3 Finance               3.24
# Create the disengaged variable and assign the result to survey
survey_disengaged <- survey %>% 
  mutate(disengaged = ifelse(engagement <= 2, 1, 0)) 

survey_disengaged
## # A tibble: 1,470 x 6
##    employee_id department  engagement salary vacation_days_tak~ disengaged
##          <int> <chr>            <int>  <dbl>              <int>      <dbl>
##  1           1 Sales                3 103264                  7       0   
##  2           2 Engineering          3  80709                 12       0   
##  3           4 Engineering          3  60737                 12       0   
##  4           5 Engineering          3  99116                  7       0   
##  5           7 Engineering          3  51022                 18       0   
##  6           8 Engineering          3  98400                  9       0   
##  7          10 Engineering          3  57106                 18       0   
##  8          11 Engineering          1  55065                  4       1.00
##  9          12 Engineering          4  77158                 12       0   
## 10          13 Engineering          2  48365                 14       1.00
## # ... with 1,460 more rows
# Summarize the three variables by department
survey_summary <- survey_disengaged %>%
  group_by(department) %>%
  summarize(pct_disengaged=mean(disengaged), 
            avg_salary=mean(salary), 
            avg_vacation_taken=mean(vacation_days_taken)
            )

survey_summary
## # A tibble: 3 x 4
##   department  pct_disengaged avg_salary avg_vacation_taken
##   <chr>                <dbl>      <dbl>              <dbl>
## 1 Engineering          0.206      73576              12.2 
## 2 Finance              0.190      76652              11.5 
## 3 Sales                0.330      75074               9.22
# Gather data for plotting
survey_gathered <- survey_summary %>% 
  gather(key = "measure", value = "value",
         pct_disengaged, avg_salary, avg_vacation_taken)

# Create three bar charts
ggplot(survey_gathered, aes(x=measure, y=value, fill=department)) +
  geom_col(position="dodge") + 
  facet_wrap(~ measure, scales="free")

# Add the in_sales variable
survey_sales <- survey %>%
  mutate(in_sales = ifelse(department == "Sales", "Sales", "Other"), 
         disengaged = ifelse(engagement < 3, 1L, 0L)
         )

# Test the hypothesis using survey_sales
chisq.test(survey_sales$disengaged, survey_sales$in_sales)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  survey_sales$disengaged and survey_sales$in_sales
## X-squared = 25.524, df = 1, p-value = 4.368e-07
t.test(disengaged ~ in_sales, data=survey_sales)
## 
##  Welch Two Sample t-test
## 
## data:  disengaged by in_sales
## t = -4.862, df = 743.16, p-value = 1.419e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.17479596 -0.07424062
## sample estimates:
## mean in group Other mean in group Sales 
##           0.2050781           0.3295964
# Test the hypothesis using the survey_sales data
t.test(vacation_days_taken ~ in_sales, data = survey_sales)
## 
##  Welch Two Sample t-test
## 
## data:  vacation_days_taken by in_sales
## t = 8.1549, df = 1022.9, p-value = 1.016e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.229473 3.642409
## sample estimates:
## mean in group Other mean in group Sales 
##           12.160156            9.224215

Chapter 3 - Are new hires getting paid too much?

Paying new hires fairly:

  • Sometimes, current employees get paid less than new employees, which can drive low engagement and turnover
  • Case study will have a simulated pay dataset available for analysis
  • Can use broom::tidy() to return the outputs in a nicely formatted data frame
    • chisq.test(survey\(in_sales, survey\)disengaged) %>% tidy()

Omitted variable bias:

  • Key assumption of the tests is that the groups are the same, with the exception of the variables being tested
  • Omitted variable bias occurs when both 1) the omitted variable is correlated with the dependent variable, and 2) the omitted variable is correlated with an explanatory variable
    • Omitted variables are often known as confounders
    • Plotting can help to identify the issue, particularly with a stacked (to 100%) bar chart
    • pay %>% ggplot(aes(x = new_hire, fill = department)) + geom_bar(position = “fill”)
    • The geom_bar() object has height that is fully dependent on x, in contrast to geom_col() which has a y-aestehtic

Linear regression helps to test the multivariate impacts of variables:

  • lm(salary ~ new_hire, data = pay) %>% tidy() # single dependent variable
  • lm(salary ~ new_hire + department, data = pay) %>% tidy() # multiple dependent variables
  • lm(salary ~ new_hire + department, data = pay) %>% summary() # more detailed summary of the linear regression

Example code includes:

# Import the data
pay <- readr::read_csv("./RInputFiles/fair_pay_data.csv")
## Parsed with column specification:
## cols(
##   employee_id = col_integer(),
##   department = col_character(),
##   salary = col_double(),
##   new_hire = col_character(),
##   job_level = col_character()
## )
# Get an overview of the data
summary(pay)
##   employee_id      department            salary         new_hire        
##  Min.   :   1.0   Length:1470        Min.   : 43820   Length:1470       
##  1st Qu.: 491.2   Class :character   1st Qu.: 59378   Class :character  
##  Median :1020.5   Mode  :character   Median : 70425   Mode  :character  
##  Mean   :1024.9                      Mean   : 74142                     
##  3rd Qu.:1555.8                      3rd Qu.: 84809                     
##  Max.   :2068.0                      Max.   :164073                     
##   job_level        
##  Length:1470       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
# Check average salary of new hires and non-new hires
pay %>% 
  group_by(new_hire) %>%
  summarize(avg_salary=mean(salary))
## # A tibble: 2 x 2
##   new_hire avg_salary
##   <chr>         <dbl>
## 1 No            73425
## 2 Yes           76074
# Perform the correct statistical test
t.test(salary ~ new_hire, data = pay)
## 
##  Welch Two Sample t-test
## 
## data:  salary by new_hire
## t = -2.3437, df = 685.16, p-value = 0.01938
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4869.4242  -429.9199
## sample estimates:
##  mean in group No mean in group Yes 
##          73424.60          76074.28
t.test(salary ~ new_hire, data = pay) %>%
  broom::tidy()
##    estimate estimate1 estimate2 statistic    p.value parameter  conf.low
## 1 -2649.672   73424.6  76074.28 -2.343708 0.01937799  685.1554 -4869.424
##   conf.high                  method alternative
## 1 -429.9199 Welch Two Sample t-test   two.sided
# Create a stacked bar chart
pay %>%
  ggplot(aes(x=new_hire, fill=job_level)) + 
  geom_bar(position="fill")

# Calculate the average salary for each group of interest
pay_grouped <- pay %>% 
  group_by(new_hire, job_level) %>% 
  summarize(avg_salary = mean(salary))
  
# Graph the results using facet_wrap()  
pay_grouped %>%
  ggplot(aes(x=new_hire, y=avg_salary)) + 
  geom_col() + 
  facet_wrap(~ job_level)

# Filter the data to include only hourly employees
pay_filter <- pay %>%
  filter(job_level == "Hourly")

# Test the difference in pay
t.test(salary ~ new_hire, data=pay_filter) %>%
  broom::tidy()
##    estimate estimate1 estimate2 statistic    p.value parameter  conf.low
## 1 -1106.967  63965.71  65072.68 -1.750387 0.08066517  499.7005 -2349.483
##   conf.high                  method alternative
## 1  135.5483 Welch Two Sample t-test   two.sided
# Run the simple regression
model_simple <- lm(salary ~ new_hire, data = pay)

# Display the summary of model_simple
model_simple %>% 
  summary()
## 
## Call:
## lm(formula = salary ~ new_hire, data = pay)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32255 -14466  -3681  10740  87998 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  73424.6      577.2 127.200   <2e-16 ***
## new_hireYes   2649.7     1109.4   2.388    0.017 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18900 on 1468 degrees of freedom
## Multiple R-squared:  0.003871,   Adjusted R-squared:  0.003193 
## F-statistic: 5.705 on 1 and 1468 DF,  p-value: 0.01704
# Display a tidy summary
model_simple %>% 
  broom::tidy()
##          term  estimate std.error  statistic    p.value
## 1 (Intercept) 73424.603  577.2369 127.200112 0.00000000
## 2 new_hireYes  2649.672 1109.3568   2.388476 0.01704414
# Run the multiple regression
model_multiple <- lm(salary ~ new_hire + job_level, data = pay)

# Display the summary of model_multiple
model_multiple %>% 
  summary()
## 
## Call:
## lm(formula = salary ~ new_hire + job_level, data = pay)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -21013  -7091   -425   6771  44322 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        64049.3      308.3 207.722   <2e-16 ***
## new_hireYes          782.7      524.8   1.491    0.136    
## job_levelManager   54918.8      915.3  60.001   <2e-16 ***
## job_levelSalaried  26865.6      567.2  47.369   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8930 on 1466 degrees of freedom
## Multiple R-squared:  0.7779, Adjusted R-squared:  0.7775 
## F-statistic:  1712 on 3 and 1466 DF,  p-value: < 2.2e-16
# Display a tidy summary
model_multiple %>% 
  broom::tidy()
##                term   estimate std.error  statistic       p.value
## 1       (Intercept) 64049.3417  308.3415 207.722067  0.000000e+00
## 2       new_hireYes   782.7358  524.8133   1.491456  1.360570e-01
## 3  job_levelManager 54918.8477  915.2966  60.001152  0.000000e+00
## 4 job_levelSalaried 26865.6336  567.1569  47.368962 7.387196e-298

Chapter 4 - Are performance ratings being given consistently?

Joining HR data:

  • Employee data tend to be stored in different locations, requiring joins (merges) prior to running analyses
    • dplyr::left_join(hr_data, bonus_pay_data, by = “employee_id”)
    • All employees in hr_data will be kept, even if there is no matching record in bonus_pay_data
    • Employee ID (or similar) is by far the best way to join data - names tend to be non-unique and can differ in different systems

Performance ratings and fairness:

  • Performance ratings are inherently subjective and thus prone to bias
  • Unconscious bias is based on the brain’s heuristics, and may include preferences for members of various groups (biases, as reflected in hiring, promotion, etc.)

Logistic regression is especially helpful for modeling binary response variables:

  • glm(high_performer ~ salary, data = hr, family = “binomial”) %>% tidy()
  • glm(high_performer ~ salary + department, data = hr, family = “binomial”) %>% tidy()

Example code includes:

# Import the data
hr_data <- readr::read_csv("./RInputFiles/hr_data.csv")
## Parsed with column specification:
## cols(
##   employee_id = col_integer(),
##   department = col_character(),
##   job_level = col_character(),
##   gender = col_character()
## )
performance_data <- readr::read_csv("./RInputFiles/performance_data.csv")
## Parsed with column specification:
## cols(
##   employee_id = col_integer(),
##   rating = col_integer()
## )
# Examine the datasets
summary(hr_data)
##   employee_id      department         job_level            gender         
##  Min.   :   1.0   Length:1470        Length:1470        Length:1470       
##  1st Qu.: 491.2   Class :character   Class :character   Class :character  
##  Median :1020.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1024.9                                                           
##  3rd Qu.:1555.8                                                           
##  Max.   :2068.0
summary(performance_data)
##   employee_id         rating    
##  Min.   :   1.0   Min.   :1.00  
##  1st Qu.: 491.2   1st Qu.:2.00  
##  Median :1020.5   Median :3.00  
##  Mean   :1024.9   Mean   :2.83  
##  3rd Qu.:1555.8   3rd Qu.:4.00  
##  Max.   :2068.0   Max.   :5.00
# Join the two tables
joined_data <- left_join(hr_data, performance_data, by = "employee_id")

# Examine the result
summary(joined_data)
##   employee_id      department         job_level            gender         
##  Min.   :   1.0   Length:1470        Length:1470        Length:1470       
##  1st Qu.: 491.2   Class :character   Class :character   Class :character  
##  Median :1020.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1024.9                                                           
##  3rd Qu.:1555.8                                                           
##  Max.   :2068.0                                                           
##      rating    
##  Min.   :1.00  
##  1st Qu.:2.00  
##  Median :3.00  
##  Mean   :2.83  
##  3rd Qu.:4.00  
##  Max.   :5.00
# Check whether the average performance rating differs by gender 
joined_data %>%
  group_by(gender) %>%
  summarize(avg_rating = mean(rating))
## # A tibble: 2 x 2
##   gender avg_rating
##   <chr>       <dbl>
## 1 Female       2.75
## 2 Male         2.92
# Add the high_performer column
performance <- joined_data %>%  
  mutate(high_performer = ifelse(rating >= 4, 1, 0))

# Test whether one gender is more likely to be a high performer
chisq.test(performance$gender, performance$high_performer)   
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  performance$gender and performance$high_performer
## X-squared = 22.229, df = 1, p-value = 2.42e-06
# Do the same test, and tidy the output
chisq.test(performance$gender, performance$high_performer) %>% broom::tidy()
##   statistic      p.value parameter
## 1  22.22915 2.419716e-06         1
##                                                         method
## 1 Pearson's Chi-squared test with Yates' continuity correction
# Visualize the distribution of high_performer by gender
performance %>%
  ggplot(aes(x=gender, fill=factor(high_performer))) + 
  geom_bar(position="fill")

# Visualize the distribution of all ratings by gender
performance %>%
  ggplot(aes(x=gender, fill=factor(rating))) + 
  geom_bar(position="fill")

# Visualize the distribution of job_level by gender
performance %>%
  ggplot(aes(x = gender, fill = job_level)) +
  geom_bar(position = "fill")

# Test whether men and women have different job level distributions
chisq.test(performance$gender, performance$job_level) 
## 
##  Pearson's Chi-squared test
## 
## data:  performance$gender and performance$job_level
## X-squared = 44.506, df = 2, p-value = 2.166e-10
# Visualize the distribution of high_performer by gender, faceted by job level
performance %>%
  ggplot(aes(x = gender, fill = factor(high_performer))) +
  geom_bar(position = "fill") + 
  facet_wrap(~ job_level)

# Run a simple logistic regression
logistic_simple <- glm(high_performer ~ gender, family = "binomial", data = performance) 

# View the result with summary()
logistic_simple %>%
  summary()
## 
## Call:
## glm(formula = high_performer ~ gender, family = "binomial", data = performance)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8871  -0.8871  -0.6957   1.4986   1.7535  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.29540    0.08813 -14.699  < 2e-16 ***
## genderMale   0.56596    0.11921   4.748 2.06e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1709.0  on 1469  degrees of freedom
## Residual deviance: 1686.1  on 1468  degrees of freedom
## AIC: 1690.1
## 
## Number of Fisher Scoring iterations: 4
# View a tidy version of the result
logistic_simple %>%
  broom::tidy()
##          term  estimate std.error  statistic      p.value
## 1 (Intercept) -1.295395 0.0881306 -14.698586 6.581976e-49
## 2  genderMale  0.565958 0.1192110   4.747532 2.059140e-06
# Run a multiple logistic regression
logistic_multiple <- glm(high_performer ~ gender + job_level, family = "binomial", data = performance)

# View the result with summary() or tidy()
logistic_multiple %>% broom::tidy()
##                term   estimate std.error  statistic      p.value
## 1       (Intercept) -1.6945360 0.1025902 -16.517522 2.744425e-61
## 2        genderMale  0.3191103 0.1290197   2.473345 1.338548e-02
## 3  job_levelManager  2.7436430 0.2514410  10.911677 1.013683e-27
## 4 job_levelSalaried  1.0992751 0.1405230   7.822742 5.168508e-15

Chapter 5 - Improving employee safety with data

Employee safety - looking at accident rates and drivers:

  • Requires joining data on multiple variables
    • joined_data <- left_join(hr_data, safety_data, by = c(“year”, “employee_id”))
    • joined_data %>% filter(is.na(accident_time)) # use is.na() instead

Focusing on the location of interest:

  • May want to run comparisons of the same location over time
  • May want to assess differences by locations to see if they may be explanatory variables

Explaining the increase in accidents:

  • Can use multiple regression to help test for explanatory variables that impact the accident rate

Wrap up:

  • Key tools from the Tidyverse (ggplot2, broom, dplyr, etc.) to assess HR data
  • Analytics usage within HR, including differences in HR and other data
  • Can apply additional data science techniques on HR data

Example code includes:

# Import the data 
hr_data <- readr::read_csv("./RInputFiles/hr_data_2.csv")
## Parsed with column specification:
## cols(
##   year = col_integer(),
##   employee_id = col_integer(),
##   location = col_character(),
##   overtime_hours = col_integer()
## )
accident_data <- readr::read_csv("./RInputFiles/accident_data.csv")
## Parsed with column specification:
## cols(
##   year = col_integer(),
##   employee_id = col_integer(),
##   accident_type = col_character()
## )
# Create hr_joined with left_join() and mutate()
hr_joined <- left_join(hr_data, accident_data, by=c("year", "employee_id")) %>% 
  mutate(had_accident=ifelse(is.na(accident_type), 0, 1))
  
hr_joined
## # A tibble: 2,940 x 6
##     year employee_id location    overtime_hours accident_type had_accident
##    <int>       <int> <chr>                <int> <chr>                <dbl>
##  1  2016           1 Northwood               14 <NA>                  0   
##  2  2017           1 Northwood                8 Mild                  1.00
##  3  2016           2 East Valley              8 <NA>                  0   
##  4  2017           2 East Valley             11 <NA>                  0   
##  5  2016           4 East Valley              4 <NA>                  0   
##  6  2017           4 East Valley              2 Mild                  1.00
##  7  2016           5 West River               0 <NA>                  0   
##  8  2017           5 West River              17 <NA>                  0   
##  9  2016           7 West River              21 <NA>                  0   
## 10  2017           7 West River               9 <NA>                  0   
## # ... with 2,930 more rows
# Find accident rate for each year
hr_joined %>% 
  group_by(year) %>% 
  summarize(accident_rate = mean(had_accident))
## # A tibble: 2 x 2
##    year accident_rate
##   <int>         <dbl>
## 1  2016        0.0850
## 2  2017        0.120
# Test difference in accident rate between years
chisq.test(hr_joined$year, hr_joined$had_accident)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  hr_joined$year and hr_joined$had_accident
## X-squared = 9.5986, df = 1, p-value = 0.001947
# Which location had the highest acccident rate?
hr_joined %>%
  group_by(location) %>%
  summarize(accident_rate=mean(had_accident)) %>%
  arrange(-accident_rate)
## # A tibble: 4 x 2
##   location    accident_rate
##   <chr>               <dbl>
## 1 East Valley        0.128 
## 2 Southfield         0.103 
## 3 West River         0.0961
## 4 Northwood          0.0831
# Compare annual accident rates by location
accident_rates <- hr_joined %>% 
  group_by(location, year) %>% 
  summarize(accident_rate = mean(had_accident))
  
accident_rates
## # A tibble: 8 x 3
## # Groups:   location [?]
##   location     year accident_rate
##   <chr>       <int>         <dbl>
## 1 East Valley  2016        0.113 
## 2 East Valley  2017        0.143 
## 3 Northwood    2016        0.0644
## 4 Northwood    2017        0.102 
## 5 Southfield   2016        0.0764
## 6 Southfield   2017        0.130 
## 7 West River   2016        0.0824
## 8 West River   2017        0.110
# Graph it
accident_rates %>% 
  ggplot(aes(factor(year), accident_rate)) +
  geom_col() +
  facet_wrap(~location)

# Filter out the other locations
southfield <- hr_joined %>% 
  filter(location == "Southfield")

# Find the average overtime hours worked by year
southfield %>%
  group_by(year) %>% 
  summarize(average_overtime_hours = mean(overtime_hours))
## # A tibble: 2 x 2
##    year average_overtime_hours
##   <int>                  <dbl>
## 1  2016                   8.67
## 2  2017                   9.97
# Test difference in Southfield's overtime hours between years
t.test(overtime_hours ~ year, data=southfield) 
## 
##  Welch Two Sample t-test
## 
## data:  overtime_hours by year
## t = -1.6043, df = 595.46, p-value = 0.1092
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.904043  0.292747
## sample estimates:
## mean in group 2016 mean in group 2017 
##           8.667774           9.973422
# Import the survey data
survey_data <- readr::read_csv("./RInputFiles/survey_data_2.csv")
## Parsed with column specification:
## cols(
##   year = col_integer(),
##   employee_id = col_integer(),
##   engagement = col_integer()
## )
# Create the safety dataset
safety <- left_join(hr_joined, survey_data, by=c("employee_id", "year")) %>%
  mutate(disengaged=ifelse(engagement <= 2, 1, 0), year=factor(year))


# Visualize the difference in % disengaged by year in Southfield
safety %>% 
    filter(location=="Southfield") %>%
    ggplot(aes(x = year, fill = factor(disengaged))) +
    geom_bar(position = "fill")

# Test whether one year had significantly more disengaged employees
southSafety <- safety %>% 
    filter(location=="Southfield")
chisq.test(southSafety$disengaged, southSafety$year)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  southSafety$disengaged and southSafety$year
## X-squared = 7.1906, df = 1, p-value = 0.007329
# Filter out Southfield
other_locs <- safety %>% 
  filter(location != "Southfield")

# Test whether one year had significantly more overtime hours worked
t.test(overtime_hours ~ year, data = other_locs) 
## 
##  Welch Two Sample t-test
## 
## data:  overtime_hours by year
## t = -0.48267, df = 2320.3, p-value = 0.6294
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9961022  0.6026035
## sample estimates:
## mean in group 2016 mean in group 2017 
##           9.278015           9.474765
# Test whether one year had significantly more disengaged employees
chisq.test(other_locs$year, other_locs$disengaged)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  other_locs$year and other_locs$disengaged
## X-squared = 0.0023091, df = 1, p-value = 0.9617
# Use multiple regression to test the impact of year and disengaged on accident rate in Southfield
regression <- glm(had_accident ~ year + disengaged, family = "binomial", data = southSafety)

# Examine the output
regression %>% broom::tidy()
##          term   estimate std.error  statistic      p.value
## 1 (Intercept) -2.9237983 0.2504685 -11.673318 1.744826e-31
## 2    year2017  0.4402062 0.2848346   1.545480 1.222302e-01
## 3  disengaged  1.4425408 0.2780927   5.187265 2.134047e-07

Supervised Learning in R: Case Studies

Chapter 1 - Cars Data

Making predictions using machine learning:

  • Course focuses on applied skills from predictive learning, using regression and classification as well as EDA
    • Regression tends to be for predicting continuous, numeric variables
    • Classification tends to be for predicting categorical variables
  • Case studies include 1) fuel efficiency, 2) Stack Overflow developer survey, 3) voter turnout, and 4) ages of nuns
  • The fuel efficiency data is stored in cars2018 and is based on data from the US Department of Energy
    • Variables names with spaces can be handled by surrounding them with backticks
    • Tidyverse includes tibble, readr, ggplot2, dplyr, tidyr, purrr, etc. - can be loaded as a package using library(tidyverse)

Getting started with caret:

  • The caret package is useful for predictive modeling - full process including the test/train split for the raw dataset
    • in_train <- createDataPartition(cars_vars$Aspiration, p = 0.8, list = FALSE) # will stratify on ‘Aspiration’ variable
    • training <- cars_vars[in_train,]
    • testing <- cars_vars[-in_train,]
  • Can then train the model using only the training dataset
    • fit_lm <- train(log(MPG) ~ ., method = “lm”, data=training, trControl=trainControl(method = “none”))
    • Can then use the yardstick package to assess the quality of the model

Sampling data:

  • Bootstrap resampling means sampling with replacement, and then fitting on the resampled dataset (run multiple times)
    • cars_rf_bt <- train(log(MPG) ~ ., method = “rf”, data = training, trControl = trainControl(method = “boot”)) # default 25 resamples
    • Can both visualize the models and assess the model statistically

Example code includes:

cars2018 <- readr::read_csv("./RInputFiles/cars2018.csv")
## Parsed with column specification:
## cols(
##   Model = col_character(),
##   `Model Index` = col_integer(),
##   Displacement = col_double(),
##   Cylinders = col_integer(),
##   Gears = col_integer(),
##   Transmission = col_character(),
##   MPG = col_integer(),
##   Aspiration = col_character(),
##   `Lockup Torque Converter` = col_character(),
##   Drive = col_character(),
##   `Max Ethanol` = col_integer(),
##   `Recommended Fuel` = col_character(),
##   `Intake Valves Per Cyl` = col_integer(),
##   `Exhaust Valves Per Cyl` = col_integer(),
##   `Fuel injection` = col_character()
## )
str(cars2018, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1144 obs. of  15 variables:
##  $ Model                  : chr  "Acura NSX" "ALFA ROMEO 4C" "Audi R8 AWD" "Audi R8 RWD" ...
##  $ Model Index            : int  57 410 65 71 66 72 46 488 38 278 ...
##  $ Displacement           : num  3.5 1.8 5.2 5.2 5.2 5.2 2 3 8 6.2 ...
##  $ Cylinders              : int  6 4 10 10 10 10 4 6 16 8 ...
##  $ Gears                  : int  9 6 7 7 7 7 6 7 7 8 ...
##  $ Transmission           : chr  "Manual" "Manual" "Manual" "Manual" ...
##  $ MPG                    : int  21 28 17 18 17 18 26 20 11 18 ...
##  $ Aspiration             : chr  "Turbocharged/Supercharged" "Turbocharged/Supercharged" "Naturally Aspirated" "Naturally Aspirated" ...
##  $ Lockup Torque Converter: chr  "Y" "Y" "Y" "Y" ...
##  $ Drive                  : chr  "All Wheel Drive" "2-Wheel Drive, Rear" "All Wheel Drive" "2-Wheel Drive, Rear" ...
##  $ Max Ethanol            : int  10 10 15 15 15 15 15 10 15 10 ...
##  $ Recommended Fuel       : chr  "Premium Unleaded Required" "Premium Unleaded Required" "Premium Unleaded Recommended" "Premium Unleaded Recommended" ...
##  $ Intake Valves Per Cyl  : int  2 2 2 2 2 2 2 2 2 1 ...
##  $ Exhaust Valves Per Cyl : int  2 2 2 2 2 2 2 2 2 1 ...
##  $ Fuel injection         : chr  "Direct ignition" "Direct ignition" "Direct ignition" "Direct ignition" ...
summary(cars2018)
##     Model            Model Index     Displacement     Cylinders     
##  Length:1144        Min.   :  1.0   Min.   :1.000   Min.   : 3.000  
##  Class :character   1st Qu.: 36.0   1st Qu.:2.000   1st Qu.: 4.000  
##  Mode  :character   Median :108.0   Median :3.000   Median : 6.000  
##                     Mean   :201.3   Mean   :3.087   Mean   : 5.564  
##                     3rd Qu.:323.8   3rd Qu.:3.600   3rd Qu.: 6.000  
##                     Max.   :821.0   Max.   :8.000   Max.   :16.000  
##      Gears        Transmission            MPG        Aspiration       
##  Min.   : 1.000   Length:1144        Min.   :11.0   Length:1144       
##  1st Qu.: 6.000   Class :character   1st Qu.:19.0   Class :character  
##  Median : 7.000   Mode  :character   Median :23.0   Mode  :character  
##  Mean   : 6.935                      Mean   :23.2                     
##  3rd Qu.: 8.000                      3rd Qu.:26.0                     
##  Max.   :10.000                      Max.   :58.0                     
##  Lockup Torque Converter    Drive            Max Ethanol   
##  Length:1144             Length:1144        Min.   :10.00  
##  Class :character        Class :character   1st Qu.:10.00  
##  Mode  :character        Mode  :character   Median :10.00  
##                                             Mean   :15.29  
##                                             3rd Qu.:15.00  
##                                             Max.   :85.00  
##  Recommended Fuel   Intake Valves Per Cyl Exhaust Valves Per Cyl
##  Length:1144        Min.   :1.000         Min.   :1.000         
##  Class :character   1st Qu.:2.000         1st Qu.:2.000         
##  Mode  :character   Median :2.000         Median :2.000         
##                     Mean   :1.926         Mean   :1.922         
##                     3rd Qu.:2.000         3rd Qu.:2.000         
##                     Max.   :2.000         Max.   :2.000         
##  Fuel injection    
##  Length:1144       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
# Print the cars2018 object
cars2018
## # A tibble: 1,144 x 15
##    Model     `Model Index` Displacement Cylinders Gears Transmission   MPG
##    <chr>             <int>        <dbl>     <int> <int> <chr>        <int>
##  1 Acura NSX            57         3.50         6     9 Manual          21
##  2 ALFA ROM~           410         1.80         4     6 Manual          28
##  3 Audi R8 ~            65         5.20        10     7 Manual          17
##  4 Audi R8 ~            71         5.20        10     7 Manual          18
##  5 Audi R8 ~            66         5.20        10     7 Manual          17
##  6 Audi R8 ~            72         5.20        10     7 Manual          18
##  7 Audi TT ~            46         2.00         4     6 Manual          26
##  8 BMW M4 D~           488         3.00         6     7 Manual          20
##  9 Bugatti ~            38         8.00        16     7 Manual          11
## 10 Chevrole~           278         6.20         8     8 Automatic       18
## # ... with 1,134 more rows, and 8 more variables: Aspiration <chr>,
## #   `Lockup Torque Converter` <chr>, Drive <chr>, `Max Ethanol` <int>,
## #   `Recommended Fuel` <chr>, `Intake Valves Per Cyl` <int>, `Exhaust
## #   Valves Per Cyl` <int>, `Fuel injection` <chr>
# Plot the histogram
ggplot(cars2018, aes(x = MPG)) +
    geom_histogram(bins = 25) +
    labs(y = "Number of cars",
         x = "Fuel efficiency (mpg)")

# Deselect the 2 columns to create cars_vars
cars_vars <- cars2018 %>%
    select(-Model, -`Model Index`)

# Fit a linear model
fit_all <- lm(MPG ~ ., data = cars_vars)

# Print the summary of the model
summary(fit_all)
## 
## Call:
## lm(formula = MPG ~ ., data = cars_vars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5261 -1.6473 -0.1096  1.3572 26.5045 
## 
## Coefficients:
##                                                 Estimate Std. Error
## (Intercept)                                    44.539519   1.176283
## Displacement                                   -3.786147   0.264845
## Cylinders                                       0.520284   0.161802
## Gears                                           0.157674   0.069984
## TransmissionCVT                                 4.877637   0.404051
## TransmissionManual                             -1.074608   0.366075
## AspirationTurbocharged/Supercharged            -2.190248   0.267559
## `Lockup Torque Converter`Y                     -2.624494   0.381252
## Drive2-Wheel Drive, Rear                       -2.676716   0.291044
## Drive4-Wheel Drive                             -3.397532   0.335147
## DriveAll Wheel Drive                           -2.941084   0.257174
## `Max Ethanol`                                  -0.007377   0.005898
## `Recommended Fuel`Premium Unleaded Required    -0.403935   0.262413
## `Recommended Fuel`Regular Unleaded Recommended -0.996343   0.272495
## `Intake Valves Per Cyl`                        -1.446107   1.620575
## `Exhaust Valves Per Cyl`                       -2.469747   1.547748
## `Fuel injection`Multipoint/sequential ignition -0.658428   0.243819
##                                                t value Pr(>|t|)    
## (Intercept)                                     37.865  < 2e-16 ***
## Displacement                                   -14.296  < 2e-16 ***
## Cylinders                                        3.216 0.001339 ** 
## Gears                                            2.253 0.024450 *  
## TransmissionCVT                                 12.072  < 2e-16 ***
## TransmissionManual                              -2.935 0.003398 ** 
## AspirationTurbocharged/Supercharged             -8.186 7.24e-16 ***
## `Lockup Torque Converter`Y                      -6.884 9.65e-12 ***
## Drive2-Wheel Drive, Rear                        -9.197  < 2e-16 ***
## Drive4-Wheel Drive                             -10.137  < 2e-16 ***
## DriveAll Wheel Drive                           -11.436  < 2e-16 ***
## `Max Ethanol`                                   -1.251 0.211265    
## `Recommended Fuel`Premium Unleaded Required     -1.539 0.124010    
## `Recommended Fuel`Regular Unleaded Recommended  -3.656 0.000268 ***
## `Intake Valves Per Cyl`                         -0.892 0.372400    
## `Exhaust Valves Per Cyl`                        -1.596 0.110835    
## `Fuel injection`Multipoint/sequential ignition  -2.700 0.007028 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.916 on 1127 degrees of freedom
## Multiple R-squared:  0.7314, Adjusted R-squared:  0.7276 
## F-statistic: 191.8 on 16 and 1127 DF,  p-value: < 2.2e-16
# Load caret
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
# Split the data into training and test sets
set.seed(1234)
in_train <- createDataPartition(cars_vars$Transmission, p = 0.8, list = FALSE)
training <- cars_vars[in_train, ]
testing <- cars_vars[-in_train, ]

# Train a linear regression model
fit_lm <- train(log(MPG) ~ ., method = "lm", data = training,
                trControl = trainControl(method = "none"))

# Print the model object
fit_lm
## Linear Regression 
## 
## 916 samples
##  12 predictor
## 
## No pre-processing
## Resampling: None
# Train a random forest model
fit_rf <- train(log(MPG) ~ ., method = "rf", data = training,
                trControl = trainControl(method = "none"))

# Print the model object
fit_rf
## Random Forest 
## 
## 916 samples
##  12 predictor
## 
## No pre-processing
## Resampling: None
# Create the new columns
results <- training %>%
    mutate(`Linear regression` = predict(fit_lm, training),
           `Random forest` = predict(fit_rf, training))

# Evaluate the performance
yardstick::metrics(results, truth = MPG, estimate = `Linear regression`)
## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  20.9 0.702
yardstick::metrics(results, truth = MPG, estimate = `Random forest`)
## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  20.9 0.845
# Create the new columns
results <- testing %>%
    mutate(`Linear regression` = predict(fit_lm, testing),
           `Random forest` = predict(fit_rf, testing))

# Evaluate the performance
yardstick::metrics(results, truth = MPG, estimate = `Linear regression`)
## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  20.5 0.799
yardstick::metrics(results, truth = MPG, estimate = `Random forest`)
## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  20.5 0.880
# Fit the models with bootstrap resampling
cars_lm_bt <- train(log(MPG) ~ ., method = "lm", data = training,
                   trControl = trainControl(method = "boot"))
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
cars_rf_bt <- train(log(MPG) ~ ., method = "rf", data = training,
                   trControl = trainControl(method = "boot"))
                   
# Quick look at the models
cars_lm_bt
## Linear Regression 
## 
## 916 samples
##  12 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 916, 916, 916, 916, 916, 916, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE       
##   0.1036278  0.7890514  0.07656104
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
cars_rf_bt
## Random Forest 
## 
## 916 samples
##  12 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 916, 916, 916, 916, 916, 916, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE        Rsquared   MAE       
##    2    0.10015480  0.8205322  0.07299305
##    9    0.08758544  0.8466598  0.06129895
##   16    0.09100659  0.8360034  0.06313542
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 9.
results <- testing %>%
    mutate(`Linear regression` = predict(cars_lm_bt, testing),
           `Random forest` = predict(cars_rf_bt, testing))

yardstick::metrics(results, truth = MPG, estimate = `Linear regression`)
## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  20.5 0.799
yardstick::metrics(results, truth = MPG, estimate = `Random forest`)
## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  20.5 0.903
results %>%
    gather(Method, Result, `Linear regression`:`Random forest`) %>%
    ggplot(aes(log(MPG), Result, color = Method)) +
    geom_point(size = 1.5, alpha = 0.5) +
    facet_wrap(~Method) +
    geom_abline(lty = 2, color = "gray50") +
    geom_smooth(method = "lm")


Chapter 2 - Stack Overflow Developer Data

Essential copying and pasting from Stack Overflow (largest and most trusted developer community):

  • Annual survey of developer perspectives on Stack Overflow - can be used for predictive modeling
  • Data is made available publicly at insights.stackoverflow.com/survey
  • Key question is “what makes a developer more likely to work remotely” (size of company, geography of employee, etc.)
    • Data are calss imbalanced, with many more Non-Remote employees than Remote employees
    • Best first step is the simplest model - logit, without any tricks
    • simple_glm <- stackoverflow %>% select(-Respondent) %>% glm(Remote ~ ., family = “binomial”, + data = .) # Remote ~ . Means “all variables” while data=. Means from the piped dataset

Dealing with imbalanced data:

  • Class imbalance is a common problem that can negatively impact model performance
    • This dataset has 10x the number of non-remote, which can influence models to just start predicting non-remote in all cases
  • One approach to class imbalance is upsampling, basically running resampling with replacement on the small class until it is the same size as the large class
    • Simple to implement, but with the risk of over-fitting
    • up_train <- upSample(x = select(training, -Remote), y = training$Remote, yname = “Remote”) %>% as_tibble()
    • stack_glm <- train(Remote ~ ., method = “glm”, family = “binomial”, data = training, trControl = trainControl(method = “boot”, sampling = “up”))

Predicting remote status:

  • Classification models can include logistic regression and random forests
    • stack_glm <- train(Remote ~ ., method = “glm”, family = “binomial”, data = training, trControl = trainControl(method = “boot”, sampling = “up”))
    • stack_rf <- train(Remote ~ ., method = “rf”, data = training, trControl = trainControl(method = “boot”, sampling = “up”))
  • Classification models can be evaluated using the confusion matrix
    • confusionMatrix(predict(stack_glm, testing), testing$Remote)
    • yardstick::accuracy(testing_results, truth = Remote, estimate = Logistic regression)
    • yardstick::ppv(testing_results, truth = Remote, estimate = Logistic regression)
    • yardstick::npv(testing_results, truth = Remote, estimate = Logistic regression)

Example code includes:

stackoverflow <- readr::read_csv("./RInputFiles/stackoverflow.csv")
## Parsed with column specification:
## cols(
##   .default = col_logical(),
##   Respondent = col_integer(),
##   Country = col_character(),
##   Salary = col_double(),
##   YearsCodedJob = col_integer(),
##   CompanySizeNumber = col_double(),
##   Remote = col_character(),
##   CareerSatisfaction = col_integer()
## )
## See spec(...) for full column specifications.
stackoverflow$Remote <- factor(stackoverflow$Remote, levels=c("Not remote", "Remote"))
str(stackoverflow, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    6991 obs. of  22 variables:
##  $ Respondent                          : int  3 15 18 19 26 55 62 71 73 77 ...
##  $ Country                             : chr  "United Kingdom" "United Kingdom" "United States" "United States" ...
##  $ Salary                              : num  113750 100000 130000 82500 175000 ...
##  $ YearsCodedJob                       : int  20 20 20 3 16 4 1 1 20 20 ...
##  $ OpenSource                          : logi  TRUE FALSE TRUE FALSE FALSE FALSE ...
##  $ Hobby                               : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
##  $ CompanySizeNumber                   : num  10000 5000 1000 10000 10000 1000 5000 20 100 1000 ...
##  $ Remote                              : Factor w/ 2 levels "Not remote","Remote": 1 2 2 1 1 1 1 1 2 2 ...
##  $ CareerSatisfaction                  : int  8 8 9 5 7 9 5 8 8 10 ...
##  $ Data scientist                      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Database administrator              : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Desktop applications developer      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Developer with stats/math background: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ DevOps                              : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ Embedded developer                  : logi  FALSE TRUE TRUE FALSE FALSE FALSE ...
##  $ Graphic designer                    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Graphics programming                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Machine learning specialist         : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Mobile developer                    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Quality assurance engineer          : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ Systems administrator               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Web developer                       : logi  FALSE FALSE TRUE TRUE TRUE TRUE ...
# Print stackoverflow
stackoverflow
## # A tibble: 6,991 x 22
##    Respondent Country        Salary YearsCodedJob OpenSource Hobby
##         <int> <chr>           <dbl>         <int> <lgl>      <lgl>
##  1          3 United Kingdom 113750            20 T          T    
##  2         15 United Kingdom 100000            20 F          T    
##  3         18 United States  130000            20 T          T    
##  4         19 United States   82500             3 F          T    
##  5         26 United States  175000            16 F          T    
##  6         55 Germany         64516             4 F          F    
##  7         62 India            6636             1 F          T    
##  8         71 United States   65000             1 F          T    
##  9         73 United States  120000            20 T          T    
## 10         77 United States   96283            20 T          T    
## # ... with 6,981 more rows, and 16 more variables:
## #   CompanySizeNumber <dbl>, Remote <fct>, CareerSatisfaction <int>, `Data
## #   scientist` <lgl>, `Database administrator` <lgl>, `Desktop
## #   applications developer` <lgl>, `Developer with stats/math
## #   background` <lgl>, DevOps <lgl>, `Embedded developer` <lgl>, `Graphic
## #   designer` <lgl>, `Graphics programming` <lgl>, `Machine learning
## #   specialist` <lgl>, `Mobile developer` <lgl>, `Quality assurance
## #   engineer` <lgl>, `Systems administrator` <lgl>, `Web developer` <lgl>
# First count for Remote
stackoverflow %>% 
    count(Remote, sort = TRUE)
## # A tibble: 2 x 2
##   Remote         n
##   <fct>      <int>
## 1 Not remote  6273
## 2 Remote       718
# then count for Country
stackoverflow %>% 
    count(Country, sort = TRUE)
## # A tibble: 5 x 2
##   Country            n
##   <chr>          <int>
## 1 United States   3486
## 2 United Kingdom  1270
## 3 Germany          950
## 4 India            666
## 5 Canada           619
ggplot(stackoverflow, aes(x=Remote, y=YearsCodedJob)) +
    geom_boxplot() +
    labs(x = NULL,
         y = "Years of professional coding experience") 

# Build a simple logistic regression model
simple_glm <- stackoverflow %>%
        select(-Respondent) %>%
        glm(Remote ~ .,
            family = "binomial",
            data = .)

# Print the summary of the model
summary(simple_glm)
## 
## Call:
## glm(formula = Remote ~ ., family = "binomial", data = .)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1942  -0.4971  -0.3824  -0.2867   2.9118  
## 
## Coefficients:
##                                              Estimate Std. Error z value
## (Intercept)                                -4.156e+00  2.929e-01 -14.187
## CountryGermany                             -2.034e-01  2.196e-01  -0.927
## CountryIndia                                9.574e-01  2.220e-01   4.312
## CountryUnited Kingdom                       5.599e-02  1.974e-01   0.284
## CountryUnited States                        5.990e-01  1.799e-01   3.330
## Salary                                      4.076e-06  1.589e-06   2.565
## YearsCodedJob                               7.133e-02  7.556e-03   9.440
## OpenSourceTRUE                              4.207e-01  8.555e-02   4.917
## HobbyTRUE                                   8.330e-03  9.827e-02   0.085
## CompanySizeNumber                          -6.104e-05  1.223e-05  -4.990
## CareerSatisfaction                          6.748e-02  2.664e-02   2.533
## `Data scientist`TRUE                       -1.186e-01  1.838e-01  -0.645
## `Database administrator`TRUE                2.763e-01  1.267e-01   2.181
## `Desktop applications developer`TRUE       -2.903e-01  9.842e-02  -2.950
## `Developer with stats/math background`TRUE  2.840e-02  1.359e-01   0.209
## DevOpsTRUE                                 -1.532e-01  1.292e-01  -1.185
## `Embedded developer`TRUE                   -2.777e-01  1.653e-01  -1.680
## `Graphic designer`TRUE                     -1.904e-01  2.725e-01  -0.699
## `Graphics programming`TRUE                  1.078e-01  2.312e-01   0.466
## `Machine learning specialist`TRUE          -2.289e-01  2.769e-01  -0.827
## `Mobile developer`TRUE                      2.170e-01  1.019e-01   2.130
## `Quality assurance engineer`TRUE           -2.826e-01  2.437e-01  -1.160
## `Systems administrator`TRUE                 1.462e-01  1.421e-01   1.029
## `Web developer`TRUE                         1.158e-01  9.993e-02   1.159
##                                            Pr(>|z|)    
## (Intercept)                                 < 2e-16 ***
## CountryGermany                             0.354161    
## CountryIndia                               1.62e-05 ***
## CountryUnited Kingdom                      0.776710    
## CountryUnited States                       0.000868 ***
## Salary                                     0.010314 *  
## YearsCodedJob                               < 2e-16 ***
## OpenSourceTRUE                             8.78e-07 ***
## HobbyTRUE                                  0.932444    
## CompanySizeNumber                          6.04e-07 ***
## CareerSatisfaction                         0.011323 *  
## `Data scientist`TRUE                       0.518709    
## `Database administrator`TRUE               0.029184 *  
## `Desktop applications developer`TRUE       0.003178 ** 
## `Developer with stats/math background`TRUE 0.834400    
## DevOpsTRUE                                 0.235833    
## `Embedded developer`TRUE                   0.093039 .  
## `Graphic designer`TRUE                     0.484596    
## `Graphics programming`TRUE                 0.641060    
## `Machine learning specialist`TRUE          0.408484    
## `Mobile developer`TRUE                     0.033194 *  
## `Quality assurance engineer`TRUE           0.246098    
## `Systems administrator`TRUE                0.303507    
## `Web developer`TRUE                        0.246655    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4627.8  on 6990  degrees of freedom
## Residual deviance: 4268.8  on 6967  degrees of freedom
## AIC: 4316.8
## 
## Number of Fisher Scoring iterations: 5
stack_select <- stackoverflow %>%
    select(-Respondent)

# Split the data into training and testing sets
set.seed(1234)
in_train <- caret::createDataPartition(stack_select$Remote, p=0.8, list = FALSE)
training <- stack_select[in_train,]
testing <- stack_select[-in_train,]


up_train <- caret::upSample(x = select(training, -Remote), y = training$Remote, yname = "Remote") %>%
    as_tibble()

up_train %>%
    count(Remote)
## # A tibble: 2 x 2
##   Remote         n
##   <fct>      <int>
## 1 Not remote  5019
## 2 Remote      5019
# Sub-sample to 5% of original
inUse <- sample(1:nrow(training), round(0.05*nrow(training)), replace=FALSE)
useTrain <- training[sort(inUse), ]

# Build a logistic regression model
stack_glm <- caret::train(Remote ~ ., method="glm", family="binomial", data = training, 
                          trControl = trainControl(method = "boot", sampling = "up")
                          )

# Print the model object 
stack_glm
## Generalized Linear Model 
## 
## 5594 samples
##   20 predictor
##    2 classes: 'Not remote', 'Remote' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 5594, 5594, 5594, 5594, 5594, 5594, ... 
## Addtional sampling using up-sampling
## 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6568743  0.1279825
# Build a random forest model
stack_rf <- caret::train(Remote ~ ., method="rf", data = useTrain, 
                         trControl = trainControl(method = "boot", sampling="up")
                         )

# Print the model object
stack_rf
## Random Forest 
## 
## 280 samples
##  20 predictor
##   2 classes: 'Not remote', 'Remote' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 280, 280, 280, 280, 280, 280, ... 
## Addtional sampling using up-sampling
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa       
##    2    0.8626254   0.110738058
##   12    0.9038825  -0.002127159
##   23    0.8887612   0.035777206
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 12.
# Confusion matrix for logistic regression model
caret::confusionMatrix(predict(stack_glm, testing), testing$Remote)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Not remote Remote
##   Not remote        837     53
##   Remote            417     90
##                                           
##                Accuracy : 0.6636          
##                  95% CI : (0.6381, 0.6883)
##     No Information Rate : 0.8976          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1395          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.6675          
##             Specificity : 0.6294          
##          Pos Pred Value : 0.9404          
##          Neg Pred Value : 0.1775          
##              Prevalence : 0.8976          
##          Detection Rate : 0.5991          
##    Detection Prevalence : 0.6371          
##       Balanced Accuracy : 0.6484          
##                                           
##        'Positive' Class : Not remote      
## 
# Confusion matrix for random forest model
caret::confusionMatrix(predict(stack_rf, testing), testing$Remote)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Not remote Remote
##   Not remote       1207    125
##   Remote             47     18
##                                           
##                Accuracy : 0.8769          
##                  95% CI : (0.8585, 0.8937)
##     No Information Rate : 0.8976          
##     P-Value [Acc > NIR] : 0.9945          
##                                           
##                   Kappa : 0.1166          
##  Mcnemar's Test P-Value : 4.327e-09       
##                                           
##             Sensitivity : 0.9625          
##             Specificity : 0.1259          
##          Pos Pred Value : 0.9062          
##          Neg Pred Value : 0.2769          
##              Prevalence : 0.8976          
##          Detection Rate : 0.8640          
##    Detection Prevalence : 0.9535          
##       Balanced Accuracy : 0.5442          
##                                           
##        'Positive' Class : Not remote      
## 
# Predict values
testing_results <- testing %>%
    mutate(`Logistic regression` = predict(stack_glm, testing), `Random forest` = predict(stack_rf, testing))

## Calculate accuracy
yardstick::accuracy(testing_results, truth = Remote, estimate = `Logistic regression`)
## [1] 0.6635648
yardstick::accuracy(testing_results, truth = Remote, estimate = `Random forest`)
## [1] 0.876879
## Calculate positive predict value
yardstick::ppv(testing_results, truth = Remote, estimate = `Logistic regression`)
## [1] 0.9404494
yardstick::ppv(testing_results, truth = Remote, estimate = `Random forest`)
## [1] 0.9061562

Chapter 3 - Voting

Predicting voter turnout from survey data:

  • Survey data available from https://www.voterstudygroup.org/publications/2016-elections/data
    • Opinions about political and economic topics
    • Includes whether the voter turned out (voted), based on self-reporting, in the 2016 election
    • Data are coded as integers, requiring a data dictionary to map the questions and responses to what they mean

Vote 2016:

  • Exploratory data analysis will help with learning about the underlying dataset
    • There are differences on many of the individual dimensions between voters and non-voters
    • A good first step can be to start with the very simplest model, Dependent ~ .

Cross-validation is the process of sub-dividing the data into folds, with each fold used once as the validation set:

  • Allows for more accurate estimates of model performance on out-of-sample error
  • Each process of CV will work through the data k times (assuming there are k folds)
    • Repeated CV is the process of running CV multiple times (this is particularly well suited to parallel processing)

Comparing model performance:

  • Random forest models tend to be more powerful and capable of classifying the training data (and thus subject to risk of overfits and associated poor quality of test set predictions)

Example code includes:

voters <- readr::read_csv("./RInputFiles/voters.csv")
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   turnout16_2016 = col_character()
## )
## See spec(...) for full column specifications.
voters$turnout16_2016 <- factor(voters$turnout16_2016, levels=c("Did not vote", "Voted"))
str(voters, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    6692 obs. of  43 variables:
##  $ case_identifier     : int  779 2108 2597 4148 4460 5225 5903 6059 8048 13112 ...
##  $ turnout16_2016      : Factor w/ 2 levels "Did not vote",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ RIGGED_SYSTEM_1_2016: int  3 2 2 1 3 3 3 2 4 2 ...
##  $ RIGGED_SYSTEM_2_2016: int  4 1 4 4 1 3 4 3 4 3 ...
##  $ RIGGED_SYSTEM_3_2016: int  1 3 1 1 3 2 1 3 1 1 ...
##  $ RIGGED_SYSTEM_4_2016: int  4 1 4 4 1 2 1 2 3 2 ...
##  $ RIGGED_SYSTEM_5_2016: int  3 3 1 2 3 2 2 1 3 2 ...
##  $ RIGGED_SYSTEM_6_2016: int  2 2 1 1 2 3 1 2 1 2 ...
##  $ track_2016          : int  2 2 1 1 2 2 1 2 2 2 ...
##  $ persfinretro_2016   : int  2 3 3 1 2 2 2 3 2 1 ...
##  $ econtrend_2016      : int  1 3 3 1 2 2 1 3 1 1 ...
##  $ Americatrend_2016   : int  1 1 1 3 3 1 2 3 2 1 ...
##  $ futuretrend_2016    : int  4 1 1 3 4 3 1 3 1 1 ...
##  $ wealth_2016         : int  2 1 2 2 1 2 2 1 2 2 ...
##  $ values_culture_2016 : int  2 3 3 3 3 2 3 3 1 3 ...
##  $ US_respect_2016     : int  2 3 1 1 2 2 2 3 3 3 ...
##  $ trustgovt_2016      : int  3 3 3 3 3 2 3 3 3 3 ...
##  $ trust_people_2016   : int  8 2 1 1 1 2 2 1 2 1 ...
##  $ helpful_people_2016 : int  1 1 2 1 1 1 2 2 1 2 ...
##  $ fair_people_2016    : int  8 2 1 1 1 2 2 1 2 1 ...
##  $ imiss_a_2016        : int  2 1 1 1 1 2 1 1 3 1 ...
##  $ imiss_b_2016        : int  2 1 1 2 1 1 1 2 1 1 ...
##  $ imiss_c_2016        : int  1 2 2 3 1 2 2 1 4 2 ...
##  $ imiss_d_2016        : int  1 2 1 1 1 1 1 2 1 1 ...
##  $ imiss_e_2016        : int  1 1 3 1 1 3 1 2 1 1 ...
##  $ imiss_f_2016        : int  2 1 1 2 1 2 1 3 2 1 ...
##  $ imiss_g_2016        : int  1 4 3 3 3 1 3 4 2 2 ...
##  $ imiss_h_2016        : int  1 2 2 2 1 1 1 2 1 1 ...
##  $ imiss_i_2016        : int  2 2 4 4 2 1 1 3 2 1 ...
##  $ imiss_j_2016        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ imiss_k_2016        : int  1 2 1 1 2 1 1 4 2 1 ...
##  $ imiss_l_2016        : int  1 4 1 2 4 1 1 3 1 1 ...
##  $ imiss_m_2016        : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ imiss_n_2016        : int  1 2 1 1 1 1 1 2 2 1 ...
##  $ imiss_o_2016        : int  2 1 1 1 1 2 1 2 2 1 ...
##  $ imiss_p_2016        : int  2 1 2 3 1 3 1 1 4 1 ...
##  $ imiss_q_2016        : int  1 1 1 2 2 1 1 4 2 1 ...
##  $ imiss_r_2016        : int  2 1 1 2 1 2 1 2 4 2 ...
##  $ imiss_s_2016        : int  1 2 1 2 2 1 1 1 1 1 ...
##  $ imiss_t_2016        : int  1 1 3 3 1 1 3 4 1 1 ...
##  $ imiss_u_2016        : int  2 2 2 2 1 3 3 1 4 2 ...
##  $ imiss_x_2016        : int  1 3 1 2 1 1 1 4 1 1 ...
##  $ imiss_y_2016        : int  1 4 2 3 1 1 1 3 2 1 ...
# Print voters
voters
## # A tibble: 6,692 x 43
##    case_identifier turnout16_2016 RIGGED_SYSTEM_1_2016 RIGGED_SYSTEM_2_20~
##              <int> <fct>                         <int>               <int>
##  1             779 Voted                             3                   4
##  2            2108 Voted                             2                   1
##  3            2597 Voted                             2                   4
##  4            4148 Voted                             1                   4
##  5            4460 Voted                             3                   1
##  6            5225 Voted                             3                   3
##  7            5903 Voted                             3                   4
##  8            6059 Voted                             2                   3
##  9            8048 Voted                             4                   4
## 10           13112 Voted                             2                   3
## # ... with 6,682 more rows, and 39 more variables:
## #   RIGGED_SYSTEM_3_2016 <int>, RIGGED_SYSTEM_4_2016 <int>,
## #   RIGGED_SYSTEM_5_2016 <int>, RIGGED_SYSTEM_6_2016 <int>,
## #   track_2016 <int>, persfinretro_2016 <int>, econtrend_2016 <int>,
## #   Americatrend_2016 <int>, futuretrend_2016 <int>, wealth_2016 <int>,
## #   values_culture_2016 <int>, US_respect_2016 <int>,
## #   trustgovt_2016 <int>, trust_people_2016 <int>,
## #   helpful_people_2016 <int>, fair_people_2016 <int>, imiss_a_2016 <int>,
## #   imiss_b_2016 <int>, imiss_c_2016 <int>, imiss_d_2016 <int>,
## #   imiss_e_2016 <int>, imiss_f_2016 <int>, imiss_g_2016 <int>,
## #   imiss_h_2016 <int>, imiss_i_2016 <int>, imiss_j_2016 <int>,
## #   imiss_k_2016 <int>, imiss_l_2016 <int>, imiss_m_2016 <int>,
## #   imiss_n_2016 <int>, imiss_o_2016 <int>, imiss_p_2016 <int>,
## #   imiss_q_2016 <int>, imiss_r_2016 <int>, imiss_s_2016 <int>,
## #   imiss_t_2016 <int>, imiss_u_2016 <int>, imiss_x_2016 <int>,
## #   imiss_y_2016 <int>
# How many people voted?
voters %>%
    count(turnout16_2016)
## # A tibble: 2 x 2
##   turnout16_2016     n
##   <fct>          <int>
## 1 Did not vote     264
## 2 Voted           6428
# How do the reponses on the survey vary with voting behavior?
voters %>%
    group_by(turnout16_2016) %>%
    summarize(`Elections don't matter` = mean(RIGGED_SYSTEM_1_2016 <= 2),
              `Economy is getting better` = mean(econtrend_2016 == 1),
              `Crime is very important` = mean(imiss_a_2016 == 2))
## # A tibble: 2 x 4
##   turnout16_2016 `Elections don't ~ `Economy is gettin~ `Crime is very im~
##   <fct>                       <dbl>               <dbl>              <dbl>
## 1 Did not vote                0.553               0.163              0.292
## 2 Voted                       0.341               0.289              0.342
## Visualize difference by voter turnout
voters %>%
    ggplot(aes(econtrend_2016, ..density.., fill = turnout16_2016)) +
    geom_histogram(alpha = 0.5, position = "identity", binwidth = 1) +
    labs(title = "Overall, is the economy getting better or worse?")

# Remove the case_indetifier column
voters_select <- voters %>%
        select(-case_identifier)

# Build a simple logistic regression model
simple_glm <- glm(turnout16_2016 ~ .,  family = "binomial", 
                  data = voters_select)

# Print the summary                  
summary(simple_glm)
## 
## Call:
## glm(formula = turnout16_2016 ~ ., family = "binomial", data = voters_select)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2373   0.1651   0.2214   0.3004   1.7708  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           2.457036   0.732721   3.353 0.000799 ***
## RIGGED_SYSTEM_1_2016  0.236284   0.085081   2.777 0.005484 ** 
## RIGGED_SYSTEM_2_2016  0.064749   0.089208   0.726 0.467946    
## RIGGED_SYSTEM_3_2016  0.049357   0.107352   0.460 0.645680    
## RIGGED_SYSTEM_4_2016 -0.074694   0.087583  -0.853 0.393749    
## RIGGED_SYSTEM_5_2016  0.190252   0.096454   1.972 0.048556 *  
## RIGGED_SYSTEM_6_2016 -0.005881   0.101381  -0.058 0.953740    
## track_2016            0.241075   0.121467   1.985 0.047178 *  
## persfinretro_2016    -0.040229   0.106714  -0.377 0.706191    
## econtrend_2016       -0.295370   0.087224  -3.386 0.000708 ***
## Americatrend_2016    -0.105213   0.080845  -1.301 0.193116    
## futuretrend_2016      0.210568   0.071201   2.957 0.003103 ** 
## wealth_2016          -0.069405   0.026344  -2.635 0.008424 ** 
## values_culture_2016  -0.041402   0.038670  -1.071 0.284332    
## US_respect_2016      -0.068200   0.043785  -1.558 0.119322    
## trustgovt_2016        0.315354   0.166655   1.892 0.058456 .  
## trust_people_2016     0.040423   0.041518   0.974 0.330236    
## helpful_people_2016  -0.037513   0.035353  -1.061 0.288643    
## fair_people_2016     -0.017081   0.030170  -0.566 0.571294    
## imiss_a_2016          0.397121   0.138987   2.857 0.004273 ** 
## imiss_b_2016         -0.250803   0.155454  -1.613 0.106667    
## imiss_c_2016          0.017536   0.090647   0.193 0.846606    
## imiss_d_2016          0.043510   0.122118   0.356 0.721619    
## imiss_e_2016         -0.095552   0.078603  -1.216 0.224126    
## imiss_f_2016         -0.323280   0.105432  -3.066 0.002168 ** 
## imiss_g_2016         -0.332034   0.078673  -4.220 2.44e-05 ***
## imiss_h_2016         -0.157298   0.107111  -1.469 0.141954    
## imiss_i_2016          0.088695   0.091467   0.970 0.332196    
## imiss_j_2016          0.060271   0.138429   0.435 0.663280    
## imiss_k_2016         -0.181030   0.082726  -2.188 0.028646 *  
## imiss_l_2016          0.274689   0.106781   2.572 0.010098 *  
## imiss_m_2016         -0.124269   0.147888  -0.840 0.400746    
## imiss_n_2016         -0.441612   0.090040  -4.905 9.36e-07 ***
## imiss_o_2016          0.198635   0.143160   1.388 0.165286    
## imiss_p_2016          0.102987   0.105669   0.975 0.329751    
## imiss_q_2016          0.244567   0.119093   2.054 0.040017 *  
## imiss_r_2016          0.198839   0.121969   1.630 0.103050    
## imiss_s_2016         -0.067310   0.134465  -0.501 0.616666    
## imiss_t_2016         -0.116757   0.068143  -1.713 0.086639 .  
## imiss_u_2016          0.022902   0.097312   0.235 0.813939    
## imiss_x_2016         -0.017789   0.097349  -0.183 0.855003    
## imiss_y_2016          0.150205   0.094536   1.589 0.112092    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2224.3  on 6691  degrees of freedom
## Residual deviance: 2004.4  on 6650  degrees of freedom
## AIC: 2088.4
## 
## Number of Fisher Scoring iterations: 6
# Split data into training and testing sets
set.seed(1234)
in_train <- caret::createDataPartition(voters_select$turnout16_2016, p = 0.8, list = FALSE)
training <- voters_select[in_train, ]
testing <- voters_select[-in_train, ]


# Perform logistic regression with upsampling and no resampling
vote_glm_1 <- caret::train(turnout16_2016 ~ ., method = "glm", family = "binomial", data = training,
                           trControl = trainControl(method = "none", sampling = "up")
                           )

# Print vote_glm
vote_glm_1
## Generalized Linear Model 
## 
## 5355 samples
##   41 predictor
##    2 classes: 'Did not vote', 'Voted' 
## 
## No pre-processing
## Resampling: None 
## Addtional sampling using up-sampling
useSmall <- sort(sample(1:nrow(training), round(0.1*nrow(training)), replace=FALSE))
trainSmall <- training[useSmall, ]

# Logistic regression
vote_glm <- caret::train(turnout16_2016 ~ ., method = "glm", family = "binomial", data = trainSmall,
                         trControl = trainControl(method = "repeatedcv", repeats = 2, sampling = "up")
                         )
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Print vote_glm
vote_glm
## Generalized Linear Model 
## 
## 536 samples
##  41 predictor
##   2 classes: 'Did not vote', 'Voted' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 2 times) 
## Summary of sample sizes: 482, 482, 482, 483, 482, 483, ... 
## Addtional sampling using up-sampling
## 
## Resampling results:
## 
##   Accuracy   Kappa     
##   0.8713138  0.04298445
# Random forest
vote_rf <- caret::train(turnout16_2016 ~ ., method = "rf", data = trainSmall,
                        trControl = trainControl(method="repeatedcv", repeats=2, sampling = "up")
                        )

# Print vote_rf
vote_rf
## Random Forest 
## 
## 536 samples
##  41 predictor
##   2 classes: 'Did not vote', 'Voted' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 2 times) 
## Summary of sample sizes: 483, 482, 483, 482, 483, 483, ... 
## Addtional sampling using up-sampling
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa       
##    2    0.9674179  -0.001265823
##   21    0.9627184  -0.006073829
##   41    0.9542628   0.019107234
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# Confusion matrix for logistic regression model on training data
caret::confusionMatrix(predict(vote_glm, trainSmall), trainSmall$turnout16_2016)
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Did not vote Voted
##   Did not vote           17    48
##   Voted                   0   471
##                                          
##                Accuracy : 0.9104         
##                  95% CI : (0.883, 0.9332)
##     No Information Rate : 0.9683         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3836         
##  Mcnemar's Test P-Value : 1.17e-11       
##                                          
##             Sensitivity : 1.00000        
##             Specificity : 0.90751        
##          Pos Pred Value : 0.26154        
##          Neg Pred Value : 1.00000        
##              Prevalence : 0.03172        
##          Detection Rate : 0.03172        
##    Detection Prevalence : 0.12127        
##       Balanced Accuracy : 0.95376        
##                                          
##        'Positive' Class : Did not vote   
## 
# Confusion matrix for random forest model on training data
caret::confusionMatrix(predict(vote_rf, trainSmall), trainSmall$turnout16_2016)
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Did not vote Voted
##   Did not vote           17     0
##   Voted                   0   519
##                                       
##                Accuracy : 1           
##                  95% CI : (0.9931, 1) 
##     No Information Rate : 0.9683      
##     P-Value [Acc > NIR] : 3.143e-08   
##                                       
##                   Kappa : 1           
##  Mcnemar's Test P-Value : NA          
##                                       
##             Sensitivity : 1.00000     
##             Specificity : 1.00000     
##          Pos Pred Value : 1.00000     
##          Neg Pred Value : 1.00000     
##              Prevalence : 0.03172     
##          Detection Rate : 0.03172     
##    Detection Prevalence : 0.03172     
##       Balanced Accuracy : 1.00000     
##                                       
##        'Positive' Class : Did not vote
## 
# Confusion matrix for logistic regression model on testing data
caret::confusionMatrix(predict(vote_glm, testing), testing$turnout16_2016)
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Did not vote Voted
##   Did not vote           14   166
##   Voted                  38  1119
##                                          
##                Accuracy : 0.8474         
##                  95% CI : (0.827, 0.8663)
##     No Information Rate : 0.9611         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0642         
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.26923        
##             Specificity : 0.87082        
##          Pos Pred Value : 0.07778        
##          Neg Pred Value : 0.96716        
##              Prevalence : 0.03889        
##          Detection Rate : 0.01047        
##    Detection Prevalence : 0.13463        
##       Balanced Accuracy : 0.57002        
##                                          
##        'Positive' Class : Did not vote   
## 
# Confusion matrix for random forest model on testing data
caret::confusionMatrix(predict(vote_rf, testing), testing$turnout16_2016)
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Did not vote Voted
##   Did not vote            1     1
##   Voted                  51  1284
##                                           
##                Accuracy : 0.9611          
##                  95% CI : (0.9493, 0.9708)
##     No Information Rate : 0.9611          
##     P-Value [Acc > NIR] : 0.5368          
##                                           
##                   Kappa : 0.0343          
##  Mcnemar's Test P-Value : 1.083e-11       
##                                           
##             Sensitivity : 0.0192308       
##             Specificity : 0.9992218       
##          Pos Pred Value : 0.5000000       
##          Neg Pred Value : 0.9617978       
##              Prevalence : 0.0388930       
##          Detection Rate : 0.0007479       
##    Detection Prevalence : 0.0014959       
##       Balanced Accuracy : 0.5092263       
##                                           
##        'Positive' Class : Did not vote    
## 

Chapter 4 - Nuns

Catholic sisters survey from 1967 - https://curate.nd.edu/show/0r967368551 with codebook at https://curate.nd.edu/downloads/0v838051f6x

  • Responses from 130,000 sisters in ~400 congergations
  • There was significant change occuring during this time period, both in society at large and within the community of nuns
  • Age has been binned in groups of 10 years (has been recoded as a numeric at the top of the range, so 20 will mean 11-20 and 30 will mean 21-30 and the like)
  • Historical dataset, centered in the context of nuns in 1967
  • Good first step is to tidy the data, so that it is easier for exploratory data analysis
    • sisters67 %>% select(-sister) %>% gather(key, value, -age)

Exploratory data analysis with tidy data:

  • Easy to see levels of agreement (overall) using dplyr::count()
  • Agreement with specific questions by age
    • tidy_sisters %>% filter(key %in% paste0(“v”, 153:170)) %>% group_by(key, value) %>% summarise(age = mean(age)) %>% ggplot(aes(value, age, color = key)) + geom_line(alpha = 0.5, size = 1.5) + geom_point(size = 2) + facet_wrap(~key)
    • Can use the mix of responses to make estimates about the ages of the nuns
  • Data will be split in to training, validation, and test sets
    • The validation set will be used for model selection

Predicting age with supervised learning:

  • “rpart” - building a tree-based (CART) model
  • “xgbLinear” - extreme gradient boosting
  • “gbm” - gradient boosted ensembles
  • Validation datasets are useful for assessing hyper-parameters and model choices, leaving the test dataset pure for a final out-of-sample error estimate

Wrap up:

  • Train-Validation-Test to select the best models, tune the parameters, and estimate the out-of-sample error rates
  • Dealing with class imbalances; improving performance with resamples (bootstraps, cross-validation, etc.)
  • Hyper-parameter tuning can be valuable, but the time investment in other areas can often generate a greater return
  • Gradient boosting and random forests tend to perform very well, but there is always value in trying out multiple models
    • Start with EDA and begin with a very simple model

Example code includes:

sisters67 <- readr::read_csv("./RInputFiles/sisters.csv")
## Parsed with column specification:
## cols(
##   .default = col_integer()
## )
## See spec(...) for full column specifications.
str(sisters67, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    19278 obs. of  67 variables:
##  $ age   : int  40 30 40 30 40 30 70 30 60 80 ...
##  $ sister: int  11545 16953 73323 75339 36303 95318 22474 114526 20707 91062 ...
##  $ v116  : int  5 4 2 4 4 2 4 4 4 5 ...
##  $ v117  : int  2 1 2 3 2 4 5 1 5 1 ...
##  $ v118  : int  2 4 5 3 3 5 5 4 5 2 ...
##  $ v119  : int  2 4 5 4 5 5 5 5 4 1 ...
##  $ v120  : int  4 1 3 3 1 1 5 1 5 2 ...
##  $ v121  : int  4 1 4 4 4 5 4 1 5 5 ...
##  $ v122  : int  4 1 2 2 4 1 1 2 2 1 ...
##  $ v123  : int  5 5 3 4 4 3 1 5 2 5 ...
##  $ v124  : int  1 1 5 2 3 1 5 3 5 4 ...
##  $ v125  : int  4 2 5 3 4 2 5 2 5 5 ...
##  $ v126  : int  2 1 1 3 1 1 5 1 5 2 ...
##  $ v127  : int  1 4 5 2 2 1 1 1 4 1 ...
##  $ v128  : int  2 1 4 3 4 4 5 2 5 3 ...
##  $ v129  : int  4 4 5 4 5 4 5 5 4 1 ...
##  $ v130  : int  2 4 4 3 3 1 5 1 5 4 ...
##  $ v131  : int  1 2 2 3 5 5 2 3 3 2 ...
##  $ v132  : int  5 5 5 4 5 2 2 5 4 5 ...
##  $ v133  : int  2 4 5 3 5 1 4 2 4 4 ...
##  $ v134  : int  2 4 4 3 4 4 1 4 4 2 ...
##  $ v135  : int  5 5 4 3 5 4 1 5 5 2 ...
##  $ v136  : int  1 4 4 2 4 4 1 4 4 2 ...
##  $ v137  : int  1 1 1 1 1 1 2 1 2 4 ...
##  $ v138  : int  2 1 3 1 3 1 4 1 2 1 ...
##  $ v139  : int  3 1 3 3 1 1 4 1 5 4 ...
##  $ v140  : int  1 2 1 2 4 4 5 2 5 2 ...
##  $ v141  : int  5 5 4 3 3 3 4 5 4 4 ...
##  $ v142  : int  1 1 2 2 2 1 2 1 4 3 ...
##  $ v143  : int  2 1 5 4 4 5 4 5 4 1 ...
##  $ v144  : int  1 2 1 2 1 1 3 1 4 2 ...
##  $ v145  : int  4 4 5 3 4 1 5 2 5 4 ...
##  $ v146  : int  4 4 5 4 5 5 4 5 2 4 ...
##  $ v147  : int  2 2 1 2 3 1 2 1 2 2 ...
##  $ v148  : int  1 1 4 1 1 4 4 1 5 1 ...
##  $ v149  : int  4 2 4 2 1 1 2 1 5 4 ...
##  $ v150  : int  2 1 2 3 1 4 2 1 5 2 ...
##  $ v151  : int  4 1 5 4 4 1 5 1 4 3 ...
##  $ v152  : int  2 1 1 3 1 1 2 1 4 4 ...
##  $ v153  : int  5 5 5 5 5 5 5 5 5 2 ...
##  $ v154  : int  1 1 4 2 1 3 5 1 4 2 ...
##  $ v155  : int  5 4 4 3 5 5 4 5 4 4 ...
##  $ v156  : int  1 1 2 2 1 1 5 1 5 2 ...
##  $ v157  : int  4 1 4 3 1 1 2 1 3 4 ...
##  $ v158  : int  4 4 5 2 5 5 2 5 5 4 ...
##  $ v159  : int  1 4 4 1 2 1 4 1 4 2 ...
##  $ v160  : int  2 5 5 4 4 4 5 5 5 4 ...
##  $ v161  : int  2 4 3 3 1 1 4 1 2 4 ...
##  $ v162  : int  5 4 5 4 4 4 5 5 5 4 ...
##  $ v163  : int  2 1 2 3 1 1 2 1 4 1 ...
##  $ v164  : int  4 1 5 2 4 1 5 1 5 4 ...
##  $ v165  : int  2 1 3 2 1 1 1 1 2 2 ...
##  $ v166  : int  2 4 5 2 1 1 5 2 5 4 ...
##  $ v167  : int  2 4 5 3 4 4 2 4 5 2 ...
##  $ v168  : int  5 5 5 4 5 5 5 5 4 5 ...
##  $ v169  : int  1 1 1 2 1 1 5 1 4 4 ...
##  $ v170  : int  5 1 4 3 2 4 4 1 2 4 ...
##  $ v171  : int  5 5 5 4 1 2 5 5 5 5 ...
##  $ v172  : int  2 1 5 5 2 2 5 1 5 3 ...
##  $ v173  : int  2 2 4 2 2 1 4 1 1 4 ...
##  $ v174  : int  2 4 2 3 4 1 5 5 4 2 ...
##  $ v175  : int  1 1 4 2 2 1 2 1 5 4 ...
##  $ v176  : int  4 4 4 3 1 4 4 3 3 2 ...
##  $ v177  : int  4 4 5 3 4 2 4 4 4 4 ...
##  $ v178  : int  4 1 4 2 1 1 2 1 4 4 ...
##  $ v179  : int  4 4 4 3 4 2 4 4 5 4 ...
##  $ v180  : int  4 2 5 3 3 1 1 1 1 2 ...
# View sisters67
glimpse(sisters67)
## Observations: 19,278
## Variables: 67
## $ age    <int> 40, 30, 40, 30, 40, 30, 70, 30, 60, 80, 90, 40, 60, 80,...
## $ sister <int> 11545, 16953, 73323, 75339, 36303, 95318, 22474, 114526...
## $ v116   <int> 5, 4, 2, 4, 4, 2, 4, 4, 4, 5, 2, 5, 4, 4, 3, 4, 5, 3, 4...
## $ v117   <int> 2, 1, 2, 3, 2, 4, 5, 1, 5, 1, 3, 2, 5, 4, 1, 1, 1, 1, 2...
## $ v118   <int> 2, 4, 5, 3, 3, 5, 5, 4, 5, 2, 4, 4, 4, 5, 2, 4, 4, 4, 2...
## $ v119   <int> 2, 4, 5, 4, 5, 5, 5, 5, 4, 1, 4, 5, 3, 4, 5, 5, 5, 5, 4...
## $ v120   <int> 4, 1, 3, 3, 1, 1, 5, 1, 5, 2, 3, 1, 5, 4, 4, 1, 1, 1, 2...
## $ v121   <int> 4, 1, 4, 4, 4, 5, 4, 1, 5, 5, 4, 1, 3, 4, 3, 2, 5, 3, 3...
## $ v122   <int> 4, 1, 2, 2, 4, 1, 1, 2, 2, 1, 4, 5, 1, 2, 4, 2, 1, 4, 2...
## $ v123   <int> 5, 5, 3, 4, 4, 3, 1, 5, 2, 5, 3, 4, 3, 4, 5, 5, 4, 5, 4...
## $ v124   <int> 1, 1, 5, 2, 3, 1, 5, 3, 5, 4, 4, 1, 3, 2, 1, 1, 3, 2, 2...
## $ v125   <int> 4, 2, 5, 3, 4, 2, 5, 2, 5, 5, 5, 5, 5, 5, 1, 1, 5, 1, 2...
## $ v126   <int> 2, 1, 1, 3, 1, 1, 5, 1, 5, 2, 4, 1, 5, 1, 3, 1, 5, 1, 2...
## $ v127   <int> 1, 4, 5, 2, 2, 1, 1, 1, 4, 1, 4, 1, 3, 5, 2, 1, 1, 2, 2...
## $ v128   <int> 2, 1, 4, 3, 4, 4, 5, 2, 5, 3, 2, 5, 5, 4, 1, 1, 4, 1, 1...
## $ v129   <int> 4, 4, 5, 4, 5, 4, 5, 5, 4, 1, 5, 1, 5, 5, 5, 1, 5, 5, 5...
## $ v130   <int> 2, 4, 4, 3, 3, 1, 5, 1, 5, 4, 5, 5, 1, 4, 1, 1, 4, 3, 2...
## $ v131   <int> 1, 2, 2, 3, 5, 5, 2, 3, 3, 2, 3, 4, 3, 4, 2, 4, 3, 4, 4...
## $ v132   <int> 5, 5, 5, 4, 5, 2, 2, 5, 4, 5, 4, 5, 5, 5, 4, 5, 3, 5, 5...
## $ v133   <int> 2, 4, 5, 3, 5, 1, 4, 2, 4, 4, 5, 1, 1, 1, 2, 4, 3, 1, 2...
## $ v134   <int> 2, 4, 4, 3, 4, 4, 1, 4, 4, 2, 3, 5, 2, 4, 4, 4, 3, 3, 4...
## $ v135   <int> 5, 5, 4, 3, 5, 4, 1, 5, 5, 2, 4, 5, 3, 5, 2, 5, 3, 5, 5...
## $ v136   <int> 1, 4, 4, 2, 4, 4, 1, 4, 4, 2, 4, 4, 4, 4, 2, 2, 4, 2, 2...
## $ v137   <int> 1, 1, 1, 1, 1, 1, 2, 1, 2, 4, 5, 1, 3, 1, 1, 1, 1, 1, 1...
## $ v138   <int> 2, 1, 3, 1, 3, 1, 4, 1, 2, 1, 3, 2, 1, 3, 2, 1, 4, 3, 1...
## $ v139   <int> 3, 1, 3, 3, 1, 1, 4, 1, 5, 4, 4, 1, 2, 4, 1, 1, 2, 1, 1...
## $ v140   <int> 1, 2, 1, 2, 4, 4, 5, 2, 5, 2, 2, 1, 5, 2, 1, 4, 1, 2, 2...
## $ v141   <int> 5, 5, 4, 3, 3, 3, 4, 5, 4, 4, 5, 5, 5, 5, 5, 4, 4, 3, 5...
## $ v142   <int> 1, 1, 2, 2, 2, 1, 2, 1, 4, 3, 4, 2, 2, 3, 2, 2, 1, 3, 1...
## $ v143   <int> 2, 1, 5, 4, 4, 5, 4, 5, 4, 1, 4, 5, 5, 2, 5, 5, 3, 3, 5...
## $ v144   <int> 1, 2, 1, 2, 1, 1, 3, 1, 4, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1...
## $ v145   <int> 4, 4, 5, 3, 4, 1, 5, 2, 5, 4, 4, 1, 4, 5, 2, 2, 1, 2, 2...
## $ v146   <int> 4, 4, 5, 4, 5, 5, 4, 5, 2, 4, 4, 4, 4, 4, 2, 5, 3, 5, 4...
## $ v147   <int> 2, 2, 1, 2, 3, 1, 2, 1, 2, 2, 3, 1, 2, 1, 2, 2, 3, 2, 4...
## $ v148   <int> 1, 1, 4, 1, 1, 4, 4, 1, 5, 1, 4, 1, 3, 1, 1, 1, 2, 1, 1...
## $ v149   <int> 4, 2, 4, 2, 1, 1, 2, 1, 5, 4, 4, 2, 5, 1, 1, 2, 5, 2, 1...
## $ v150   <int> 2, 1, 2, 3, 1, 4, 2, 1, 5, 2, 5, 2, 2, 2, 3, 1, 5, 1, 1...
## $ v151   <int> 4, 1, 5, 4, 4, 1, 5, 1, 4, 3, 4, 1, 2, 5, 2, 4, 5, 1, 4...
## $ v152   <int> 2, 1, 1, 3, 1, 1, 2, 1, 4, 4, 4, 1, 4, 3, 4, 1, 1, 1, 2...
## $ v153   <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 3, 2, 5, 5, 4, 5, 5, 5, 4...
## $ v154   <int> 1, 1, 4, 2, 1, 3, 5, 1, 4, 2, 5, 1, 5, 5, 1, 1, 4, 1, 1...
## $ v155   <int> 5, 4, 4, 3, 5, 5, 4, 5, 4, 4, 3, 4, 3, 5, 2, 5, 5, 5, 1...
## $ v156   <int> 1, 1, 2, 2, 1, 1, 5, 1, 5, 2, 5, 1, 1, 4, 1, 1, 3, 1, 1...
## $ v157   <int> 4, 1, 4, 3, 1, 1, 2, 1, 3, 4, 2, 1, 2, 3, 3, 2, 3, 1, 1...
## $ v158   <int> 4, 4, 5, 2, 5, 5, 2, 5, 5, 4, 4, 5, 4, 2, 5, 4, 4, 3, 4...
## $ v159   <int> 1, 4, 4, 1, 2, 1, 4, 1, 4, 2, 4, 1, 3, 2, 1, 1, 2, 1, 1...
## $ v160   <int> 2, 5, 5, 4, 4, 4, 5, 5, 5, 4, 5, 2, 5, 5, 5, 4, 5, 2, 4...
## $ v161   <int> 2, 4, 3, 3, 1, 1, 4, 1, 2, 4, 5, 1, 4, 5, 1, 1, 3, 1, 1...
## $ v162   <int> 5, 4, 5, 4, 4, 4, 5, 5, 5, 4, 4, 5, 5, 5, 3, 4, 5, 5, 5...
## $ v163   <int> 2, 1, 2, 3, 1, 1, 2, 1, 4, 1, 4, 1, 1, 1, 1, 2, 3, 3, 1...
## $ v164   <int> 4, 1, 5, 2, 4, 1, 5, 1, 5, 4, 4, 1, 1, 5, 1, 4, 3, 1, 4...
## $ v165   <int> 2, 1, 3, 2, 1, 1, 1, 1, 2, 2, 5, 2, 1, 5, 2, 3, 3, 2, 4...
## $ v166   <int> 2, 4, 5, 2, 1, 1, 5, 2, 5, 4, 5, 1, 2, 4, 2, 4, 5, 3, 4...
## $ v167   <int> 2, 4, 5, 3, 4, 4, 2, 4, 5, 2, 4, 4, 2, 5, 2, 4, 3, 2, 4...
## $ v168   <int> 5, 5, 5, 4, 5, 5, 5, 5, 4, 5, 5, 4, 5, 5, 3, 4, 3, 4, 5...
## $ v169   <int> 1, 1, 1, 2, 1, 1, 5, 1, 4, 4, 5, 1, 1, 1, 1, 1, 1, 1, 1...
## $ v170   <int> 5, 1, 4, 3, 2, 4, 4, 1, 2, 4, 3, 3, 3, 5, 4, 3, 5, 3, 4...
## $ v171   <int> 5, 5, 5, 4, 1, 2, 5, 5, 5, 5, 5, 1, 5, 5, 3, 4, 5, 4, 5...
## $ v172   <int> 2, 1, 5, 5, 2, 2, 5, 1, 5, 3, 5, 1, 5, 5, 2, 2, 3, 5, 2...
## $ v173   <int> 2, 2, 4, 2, 2, 1, 4, 1, 1, 4, 4, 1, 2, 5, 4, 4, 3, 1, 4...
## $ v174   <int> 2, 4, 2, 3, 4, 1, 5, 5, 4, 2, 4, 5, 3, 4, 2, 4, 3, 3, 4...
## $ v175   <int> 1, 1, 4, 2, 2, 1, 2, 1, 5, 4, 3, 1, 2, 4, 1, 4, 3, 1, 1...
## $ v176   <int> 4, 4, 4, 3, 1, 4, 4, 3, 3, 2, 5, 5, 3, 5, 3, 1, 3, 3, 2...
## $ v177   <int> 4, 4, 5, 3, 4, 2, 4, 4, 4, 4, 5, 2, 5, 5, 3, 2, 5, 4, 4...
## $ v178   <int> 4, 1, 4, 2, 1, 1, 2, 1, 4, 4, 4, 1, 2, 4, 1, 2, 3, 1, 2...
## $ v179   <int> 4, 4, 4, 3, 4, 2, 4, 4, 5, 4, 5, 2, 5, 5, 3, 1, 5, 3, 4...
## $ v180   <int> 4, 2, 5, 3, 3, 1, 1, 1, 1, 2, 4, 2, 2, 5, 1, 1, 3, 3, 2...
# Plot the histogram
ggplot(sisters67, aes(x = age)) +
    geom_histogram(binwidth = 10)

# Tidy the data set
tidy_sisters <- sisters67 %>%
    select(-sister) %>%
    gather(key, value, -age)

# Print the structure of tidy_sisters
glimpse(tidy_sisters)
## Observations: 1,253,070
## Variables: 3
## $ age   <int> 40, 30, 40, 30, 40, 30, 70, 30, 60, 80, 90, 40, 60, 80, ...
## $ key   <chr> "v116", "v116", "v116", "v116", "v116", "v116", "v116", ...
## $ value <int> 5, 4, 2, 4, 4, 2, 4, 4, 4, 5, 2, 5, 4, 4, 3, 4, 5, 3, 4,...
# Overall agreement with all questions varied by age
tidy_sisters %>%
    group_by(age) %>%
    summarize(value = mean(value, na.rm = TRUE))
## # A tibble: 9 x 2
##     age value
##   <int> <dbl>
## 1    20  2.82
## 2    30  2.81
## 3    40  2.82
## 4    50  2.95
## 5    60  3.10
## 6    70  3.25
## 7    80  3.39
## 8    90  3.55
## 9   100  3.93
# Number of respondents agreed or disagreed overall
tidy_sisters %>%
    count(value)
## # A tibble: 5 x 2
##   value      n
##   <int>  <int>
## 1     1 326386
## 2     2 211534
## 3     3 160961
## 4     4 277062
## 5     5 277127
# Visualize agreement with age
tidy_sisters %>%
    filter(key %in% paste0("v", 153:170)) %>%
    group_by(key, value) %>%
    summarize(age = mean(age, na.rm = TRUE)) %>%
    ggplot(aes(value, age, color = key)) +
    geom_line(show.legend = FALSE) +
    facet_wrap(~key, nrow = 3)

# Remove the sister column
sisters_select <- sisters67 %>% 
    select(-sister)

# Build a simple linear regression model
simple_lm <- lm(age ~ ., 
                data = sisters_select)

# Print the summary of the model
summary(simple_lm)
## 
## Call:
## lm(formula = age ~ ., data = sisters_select)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.663  -9.586  -1.207   8.991  53.286 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.59542    1.07173  25.748  < 2e-16 ***
## v116        -0.69014    0.07727  -8.931  < 2e-16 ***
## v117        -0.15914    0.08869  -1.794 0.072786 .  
## v118        -0.74668    0.08473  -8.813  < 2e-16 ***
## v119        -0.35314    0.08321  -4.244 2.21e-05 ***
## v120        -0.13875    0.07513  -1.847 0.064813 .  
## v121         0.04265    0.07794   0.547 0.584247    
## v122         0.05237    0.08086   0.648 0.517208    
## v123        -0.96372    0.09061 -10.636  < 2e-16 ***
## v124         0.44543    0.08681   5.131 2.91e-07 ***
## v125         0.50420    0.07425   6.791 1.15e-11 ***
## v126         0.44358    0.08579   5.170 2.36e-07 ***
## v127        -0.04781    0.07915  -0.604 0.545810    
## v128         0.04459    0.07595   0.587 0.557162    
## v129         0.03044    0.07881   0.386 0.699351    
## v130         0.51028    0.08064   6.328 2.54e-10 ***
## v131        -0.54431    0.08417  -6.467 1.02e-10 ***
## v132        -0.02527    0.09337  -0.271 0.786703    
## v133        -0.67041    0.07563  -8.864  < 2e-16 ***
## v134        -0.12144    0.09060  -1.340 0.180130    
## v135         0.45773    0.10886   4.205 2.63e-05 ***
## v136        -0.08790    0.07438  -1.182 0.237293    
## v137         0.74412    0.10230   7.274 3.63e-13 ***
## v138         0.31534    0.10601   2.974 0.002939 ** 
## v139         1.36585    0.10514  12.990  < 2e-16 ***
## v140        -0.73675    0.07371  -9.995  < 2e-16 ***
## v141         0.50515    0.09355   5.400 6.75e-08 ***
## v142        -0.22168    0.08357  -2.653 0.007992 ** 
## v143         0.08320    0.08375   0.993 0.320536    
## v144         1.09413    0.10870  10.066  < 2e-16 ***
## v145        -0.46821    0.08217  -5.698 1.23e-08 ***
## v146        -0.50063    0.08094  -6.185 6.32e-10 ***
## v147        -0.28499    0.09800  -2.908 0.003640 ** 
## v148         1.47288    0.09165  16.070  < 2e-16 ***
## v149        -0.29683    0.08562  -3.467 0.000528 ***
## v150        -0.33882    0.08396  -4.036 5.46e-05 ***
## v151         0.79497    0.08901   8.931  < 2e-16 ***
## v152        -0.02073    0.08179  -0.253 0.799906    
## v153        -0.53982    0.09110  -5.925 3.17e-09 ***
## v154         0.98930    0.07843  12.614  < 2e-16 ***
## v155         0.96066    0.09897   9.706  < 2e-16 ***
## v156         1.07836    0.09176  11.752  < 2e-16 ***
## v157         0.07577    0.08249   0.918 0.358378    
## v158         0.05330    0.08419   0.633 0.526696    
## v159        -0.28846    0.08321  -3.467 0.000528 ***
## v160         0.28066    0.08559   3.279 0.001043 ** 
## v161         0.67235    0.08759   7.677 1.71e-14 ***
## v162        -0.29388    0.10063  -2.920 0.003501 ** 
## v163        -1.38883    0.09242 -15.027  < 2e-16 ***
## v164        -0.44411    0.07017  -6.329 2.52e-10 ***
## v165        -0.49356    0.09033  -5.464 4.71e-08 ***
## v166         0.24787    0.08329   2.976 0.002924 ** 
## v167        -0.06290    0.08185  -0.768 0.442236    
## v168         0.33712    0.09425   3.577 0.000349 ***
## v169         1.44938    0.08634  16.786  < 2e-16 ***
## v170         1.01626    0.09083  11.189  < 2e-16 ***
## v171         0.90086    0.08359  10.777  < 2e-16 ***
## v172         0.07702    0.07176   1.073 0.283135    
## v173         0.76461    0.06936  11.025  < 2e-16 ***
## v174         0.22074    0.07851   2.812 0.004934 ** 
## v175         0.18369    0.07930   2.316 0.020553 *  
## v176         1.03334    0.08996  11.487  < 2e-16 ***
## v177        -0.07908    0.09643  -0.820 0.412153    
## v178        -0.08005    0.08250  -0.970 0.331906    
## v179         0.29778    0.09251   3.219 0.001289 ** 
## v180         0.11524    0.08566   1.345 0.178538    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.13 on 19212 degrees of freedom
## Multiple R-squared:  0.3332, Adjusted R-squared:  0.3309 
## F-statistic: 147.7 on 65 and 19212 DF,  p-value: < 2.2e-16
# Split the data into training and validation/test sets
set.seed(1234)
in_train <- caret::createDataPartition(sisters_select$age, p = 0.6, list = FALSE)
training <- sisters_select[in_train, ]
validation_test <- sisters_select[-in_train, ]

# Split the validation and test sets
set.seed(1234)
in_test <- caret::createDataPartition(validation_test$age, p = 0.5, list = FALSE)
testing <- validation_test[in_test, ]
validation <- validation_test[-in_test, ]


# Fit a CART model
sisters_cart <- caret::train(age ~ ., method = "rpart", data = training)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
# Print the CART model
sisters_cart
## CART 
## 
## 11569 samples
##    65 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 11569, 11569, 11569, 11569, 11569, 11569, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE      Rsquared   MAE     
##   0.02304336  14.61359  0.1724244  12.00686
##   0.04935303  14.89119  0.1403800  12.41303
##   0.11481230  15.54485  0.1046127  13.19914
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.02304336.
inSmall <- sample(1:nrow(training), 500, replace=FALSE)
smallSisters <- training[sort(inSmall), ]

sisters_xgb <- caret::train(age ~ ., method = "xgbTree", data = smallSisters)
sisters_gbm <- caret::train(age ~ ., method = "gbm", data = smallSisters, verbose=FALSE)

# Make predictions on the three models
modeling_results <- validation %>%
    mutate(CART = predict(sisters_cart, validation),
           XGB = predict(sisters_xgb, validation),
           GBM = predict(sisters_gbm, validation))

# View the predictions
modeling_results %>% 
    select(CART, XGB, GBM)
## # A tibble: 3,854 x 3
##     CART   XGB   GBM
##    <dbl> <dbl> <dbl>
##  1  49.5  46.2  44.3
##  2  49.5  61.1  56.5
##  3  58.0  59.9  65.6
##  4  58.0  60.0  61.9
##  5  58.0  71.6  74.6
##  6  49.5  50.9  53.4
##  7  49.5  58.6  55.0
##  8  49.5  42.2  38.0
##  9  41.3  41.7  38.2
## 10  58.0  51.6  50.0
## # ... with 3,844 more rows
# Compare performace
yardstick::metrics(modeling_results, truth = age, estimate = CART)
## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  14.6 0.163
yardstick::metrics(modeling_results, truth = age, estimate = XGB)
## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  13.5 0.287
yardstick::metrics(modeling_results, truth = age, estimate = GBM)
## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  13.6 0.286
# Calculate RMSE
testing %>%
    mutate(prediction = predict(sisters_gbm, testing)) %>%
    yardstick::rmse(truth = age, estimate = prediction)
## [1] 13.87981

Business Process Analytics in R

Chapter 1 - Introduction to Process Analysis

Introduction and overview:

  • Efficient processes are core to many businesses, and improved data makes further analysis possible
  • The “internet of things” has created significant amounts of event data - why, what, and who
    • Why is the purpose
    • What is the steps in the process
    • Who is the person responsible for the activity (can be machines or IS or the like; referred to as “resources”)
  • Process workflow is iterative across Extraction-Processing-Analysis

Activities as cornerstones of processes:

  • Data from an online learning platform; activities are captured and can be used for further analysis
  • Activities describe the flow of the process, and are one of the most important components of the process
    • bupaR::activities_labels() is like names() for activities data
    • bupaR::activities() is like summary() for activities data
  • Each case is described by the sequence of activities, known as its “trace”
    • bupaR::traces() will create a frequency table of the traces
    • bupaR::trace_explorer() will visualize the cases

Components of process data:

  • Cases are the objects flowing through the process, while activities are the actions performed on them
    • An activity instance is the occurrence of an activity (which can be a series of events) - specific action, case, time, etc.
    • The “lifecycle status” is an area like Scheduled, Started, Completed, and the like
    • The “event log” is the journal of the events
    • The “resources” are the actors in the process
  • Can create an event log using the eventlog() function
    • event_data %>% eventlog(case_id = “patient”, activity_id = “handling”, activity_instance_id = “handling_id”, timestamp = “time”, lifecycle_id = “registration_type”, resource = “employee”)

Example code includes:

# Load the processmapR package using library
library(processmapR)
## 
## Attaching package: 'processmapR'
## The following object is masked from 'package:stats':
## 
##     frequency
library(bupaR)
## Loading required package: edeaR
## Loading required package: eventdataR
## Loading required package: xesreadR
## 
## Attaching package: 'bupaR'
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:utils':
## 
##     timestamp
handling <- c('Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Blood test', 'Blood test', 'Blood test', 'Blood test', 'Blood test', 'Blood test', 'Blood test', 'Blood test', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'X-Ray', 'X-Ray', 'X-Ray', 'X-Ray', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Registration', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Triage and Assessment', 'Blood test', 'Blood test', 'Blood test', 'Blood test', 'Blood test', 'Blood test', 'Blood test', 'Blood test', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'MRI SCAN', 'X-Ray', 'X-Ray', 'X-Ray', 'X-Ray', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Discuss Results', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out', 'Check-out')
patient <- c('43', '155', '156', '170', '172', '184', '221', '278', '348', '420', '455', '493', '43', '155', '156', '170', '172', '184', '221', '278', '348', '420', '455', '493', '43', '156', '170', '172', '184', '278', '348', '420', '43', '156', '170', '172', '184', '278', '348', '420', '155', '221', '455', '493', '43', '155', '156', '170', '172', '184', '221', '278', '348', '420', '455', '493', '43', '155', '156', '170', '172', '184', '221', '278', '348', '420', '455', '493', '43', '155', '156', '170', '172', '184', '221', '278', '348', '420', '455', '493', '43', '155', '156', '170', '172', '184', '221', '278', '348', '420', '455', '493', '43', '156', '170', '172', '184', '278', '348', '420', '43', '156', '170', '172', '184', '278', '348', '420', '155', '221', '455', '493', '43', '155', '156', '170', '172', '184', '221', '278', '348', '420', '455', '493', '43', '155', '156', '170', '172', '184', '221', '278', '348', '420', '455', '493')
employee <- c('r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r3', 'r3', 'r3', 'r3', 'r3', 'r3', 'r3', 'r3', 'r4', 'r4', 'r4', 'r4', 'r4', 'r4', 'r4', 'r4', 'r5', 'r5', 'r5', 'r5', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r1', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r2', 'r3', 'r3', 'r3', 'r3', 'r3', 'r3', 'r3', 'r3', 'r4', 'r4', 'r4', 'r4', 'r4', 'r4', 'r4', 'r4', 'r5', 'r5', 'r5', 'r5', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r6', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7', 'r7')
handling_id <- c('43', '155', '156', '170', '172', '184', '221', '278', '348', '420', '455', '493', '543', '655', '656', '670', '672', '684', '721', '778', '848', '920', '955', '993', '1020', '1072', '1081', '1082', '1088', '1127', '1163', '1199', '1257', '1309', '1318', '1319', '1325', '1364', '1400', '1436', '1557', '1587', '1710', '1730', '1777', '1889', '1890', '1904', '1906', '1918', '1955', '2012', '2082', '2154', '2189', '2227', '2272', '2384', '2385', '2399', '2401', '2413', '2450', '2507', '2577', '2649', '2684', '2720', '43', '155', '156', '170', '172', '184', '221', '278', '348', '420', '455', '493', '543', '655', '656', '670', '672', '684', '721', '778', '848', '920', '955', '993', '1020', '1072', '1081', '1082', '1088', '1127', '1163', '1199', '1257', '1309', '1318', '1319', '1325', '1364', '1400', '1436', '1557', '1587', '1710', '1730', '1777', '1889', '1890', '1904', '1906', '1918', '1955', '2012', '2082', '2154', '2189', '2227', '2272', '2384', '2385', '2399', '2401', '2413', '2450', '2507', '2577', '2649', '2684', '2720')
registration_type <- c('start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete')
rTime <- c('2017-02-19 04:38:51', '2017-06-03 10:05:28', '2017-06-03 10:05:28', '2017-06-17 15:10:30', '2017-06-17 23:00:33', '2017-06-27 07:48:22', '2017-08-03 17:05:27', '2017-09-26 20:22:49', '2017-11-24 08:28:44', '2018-02-08 03:39:21', '2018-03-14 21:04:28', '2018-04-29 04:55:10', '2017-02-19 07:28:53', '2017-06-04 06:27:00', '2017-06-03 13:23:14', '2017-06-17 16:31:58', '2017-06-18 18:29:13', '2017-06-28 00:14:50', '2017-08-04 07:22:06', '2017-09-27 22:57:03', '2017-11-24 10:33:00', '2018-02-08 17:33:12', '2018-03-15 15:12:41', '2018-04-30 19:40:22', '2017-02-20 19:59:18', '2017-06-04 15:18:50', '2017-06-18 22:51:07', '2017-06-21 02:43:27', '2017-07-01 23:55:10', '2017-09-28 22:58:23', '2017-11-25 12:06:18', '2018-02-12 09:01:38', '2017-02-21 06:49:49', '2017-06-04 23:23:28', '2017-06-19 06:44:30', '2017-06-21 11:16:30', '2017-07-02 11:16:08', '2017-09-29 07:28:10', '2017-11-25 21:54:56', '2018-02-12 19:43:42', '2017-06-05 00:12:24', '2017-08-05 08:25:17', '2018-03-17 10:30:24', '2018-05-02 07:32:45', '2017-02-21 14:50:43', '2017-06-05 14:03:19', '2017-06-05 10:26:16', '2017-06-19 22:46:10', '2017-06-22 04:39:35', '2017-07-03 01:28:49', '2017-08-05 22:06:23', '2017-09-29 19:13:51', '2017-11-26 06:52:23', '2018-02-17 02:44:58', '2018-03-18 00:20:51', '2018-05-02 18:14:11', '2017-02-24 14:58:43', '2017-06-05 15:58:53', '2017-06-05 15:58:53', '2017-06-20 03:48:37', '2017-06-22 08:40:55', '2017-07-03 03:39:51', '2017-08-08 23:17:45', '2017-09-29 21:16:01', '2017-11-27 04:56:53', '2018-02-20 09:49:29', '2018-03-18 08:12:07', '2018-05-03 00:11:10', '2017-02-19 07:28:53', '2017-06-03 14:19:00', '2017-06-03 13:23:14', '2017-06-17 16:31:58', '2017-06-18 01:07:42', '2017-06-27 12:22:51', '2017-08-03 19:25:12', '2017-09-26 22:17:18', '2017-11-24 10:33:00', '2018-02-08 06:01:38', '2018-03-15 00:34:01', '2018-04-29 07:39:14', '2017-02-19 21:58:08', '2017-06-04 14:23:26', '2017-06-04 06:27:00', '2017-06-18 04:14:55', '2017-06-19 00:40:19', '2017-06-28 12:48:20', '2017-08-04 21:09:17', '2017-09-28 12:00:12', '2017-11-25 00:44:30', '2018-02-09 07:05:52', '2018-03-16 04:08:03', '2018-05-01 10:37:51', '2017-02-21 03:12:26', '2017-06-04 19:35:51', '2017-06-19 03:01:11', '2017-06-21 08:02:20', '2017-07-02 07:43:48', '2017-09-29 04:58:49', '2017-11-25 18:30:43', '2018-02-12 13:57:13', '2017-02-21 09:57:05', '2017-06-05 02:46:59', '2017-06-19 11:40:53', '2017-06-21 16:09:26', '2017-07-02 16:03:16', '2017-09-29 12:44:39', '2017-11-26 02:40:30', '2018-02-12 23:53:46', '2017-06-05 04:39:38', '2017-08-05 13:56:39', '2018-03-17 14:09:40', '2018-05-02 12:24:41', '2017-02-21 17:57:58', '2017-06-05 15:58:53', '2017-06-05 14:03:19', '2017-06-20 01:44:29', '2017-06-22 08:40:55', '2017-07-03 03:39:51', '2017-08-05 23:53:27', '2017-09-29 21:16:01', '2017-11-26 09:44:37', '2018-02-17 06:17:57', '2018-03-18 03:22:17', '2018-05-02 21:17:12', '2017-02-24 16:03:49', '2017-06-05 17:22:16', '2017-06-05 17:15:30', '2017-06-20 05:36:40', '2017-06-22 10:59:58', '2017-07-03 05:00:48', '2017-08-09 00:13:39', '2017-09-29 23:42:48', '2017-11-27 06:53:23', '2018-02-20 12:04:00', '2018-03-18 10:48:34', '2018-05-03 02:11:42')
rOrder <- c(43, 155, 156, 170, 172, 184, 221, 278, 348, 420, 455, 493, 543, 655, 656, 670, 672, 684, 721, 778, 848, 920, 955, 993, 1020, 1072, 1081, 1082, 1088, 1127, 1163, 1199, 1257, 1309, 1318, 1319, 1325, 1364, 1400, 1436, 1557, 1587, 1710, 1730, 1777, 1889, 1890, 1904, 1906, 1918, 1955, 2012, 2082, 2154, 2189, 2227, 2272, 2384, 2385, 2399, 2401, 2413, 2450, 2507, 2577, 2649, 2684, 2720, 2764, 2876, 2877, 2891, 2893, 2905, 2942, 2999, 3069, 3141, 3176, 3214, 3264, 3376, 3377, 3391, 3393, 3405, 3442, 3499, 3569, 3641, 3676, 3714, 3741, 3793, 3802, 3803, 3809, 3848, 3884, 3920, 3978, 4030, 4039, 4040, 4046, 4085, 4121, 4157, 4278, 4308, 4431, 4451, 4498, 4610, 4611, 4625, 4627, 4639, 4676, 4733, 4803, 4875, 4910, 4948, 4993, 5105, 5106, 5120, 5122, 5134, 5171, 5228, 5298, 5370, 5405, 5441)

pFrame <- tibble(handling=factor(handling, levels=c('Blood test', 'Check-out', 'Discuss Results', 'MRI SCAN', 'Registration', 'Triage and Assessment', 'X-Ray')), 
                 patient=patient, 
                 employee=factor(employee, levels=c('r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7')), 
                 handling_id=handling_id, 
                 registration_type=factor(registration_type, levels=c("complete", "start")), 
                 time=as.POSIXct(rTime), 
                 .order=rOrder
                 )

patients <- eventlog(pFrame,
    case_id = "patient",
    activity_id = "handling",
    activity_instance_id = "handling_id",
    lifecycle_id = "registration_type",
    timestamp = "time",
    resource_id = "employee")


# The function slice can be used to take a slice of cases out of the eventdata. slice(1:10) will select the first ten cases in the event log, where first is defined by the current ordering of the data.

# How many patients are there?
n_cases(patients)
## [1] 12
# Print the summary of the data
summary(patients)
## Number of events:  136
## Number of cases:  12
## Number of traces:  2
## Number of distinct activities:  7
## Average trace length:  11.33333
## 
## Start eventlog:  2017-02-19 04:38:51
## End eventlog:  2018-05-03 02:11:42
##                   handling    patient          employee handling_id       
##  Blood test           :16   Length:136         r1:24    Length:136        
##  Check-out            :24   Class :character   r2:24    Class :character  
##  Discuss Results      :24   Mode  :character   r3:16    Mode  :character  
##  MRI SCAN             :16                      r4:16                      
##  Registration         :24                      r5: 8                      
##  Triage and Assessment:24                      r6:24                      
##  X-Ray                : 8                      r7:24                      
##  registration_type      time                         .order      
##  complete:68       Min.   :2017-02-19 04:38:51   Min.   :  1.00  
##  start   :68       1st Qu.:2017-06-14 15:43:26   1st Qu.: 34.75  
##                    Median :2017-07-03 03:39:51   Median : 68.50  
##                    Mean   :2017-09-06 11:31:32   Mean   : 68.50  
##                    3rd Qu.:2017-11-26 14:32:41   3rd Qu.:102.25  
##                    Max.   :2018-05-03 02:11:42   Max.   :136.00  
## 
# Show the journey of the first patient
slice(patients, 1)
## Event log consisting of:
## 12 events
## 1 traces
## 1 cases
## 6 activities
## 6 activity instances
## 
## # A tibble: 12 x 7
##    handling              patient employee handling_id registration_type
##    <fct>                 <chr>   <fct>    <chr>       <fct>            
##  1 Registration          43      r1       43          start            
##  2 Triage and Assessment 43      r2       543         start            
##  3 Blood test            43      r3       1020        start            
##  4 MRI SCAN              43      r4       1257        start            
##  5 Discuss Results       43      r6       1777        start            
##  6 Check-out             43      r7       2272        start            
##  7 Registration          43      r1       43          complete         
##  8 Triage and Assessment 43      r2       543         complete         
##  9 Blood test            43      r3       1020        complete         
## 10 MRI SCAN              43      r4       1257        complete         
## 11 Discuss Results       43      r6       1777        complete         
## 12 Check-out             43      r7       2272        complete         
## # ... with 2 more variables: time <dttm>, .order <int>
# How many distinct activities are there?
n_activities(patients)
## [1] 7
# What are the names of the activities?
activity_labels(patients)
## [1] Registration          Triage and Assessment Blood test           
## [4] MRI SCAN              X-Ray                 Discuss Results      
## [7] Check-out            
## 7 Levels: Blood test Check-out Discuss Results MRI SCAN ... X-Ray
# Create a list of activities
activities(patients)
## # A tibble: 7 x 3
##   handling              absolute_frequency relative_frequency
##   <fct>                              <int>              <dbl>
## 1 Check-out                             12             0.176 
## 2 Discuss Results                       12             0.176 
## 3 Registration                          12             0.176 
## 4 Triage and Assessment                 12             0.176 
## 5 Blood test                             8             0.118 
## 6 MRI SCAN                               8             0.118 
## 7 X-Ray                                  4             0.0588
# Have a look at the different traces
traces(patients)
## # A tibble: 2 x 3
##   trace                                 absolute_frequen~ relative_freque~
##   <chr>                                             <int>            <dbl>
## 1 Registration,Triage and Assessment,B~                 8            0.667
## 2 Registration,Triage and Assessment,X~                 4            0.333
# How many are there?
n_traces(patients)
## [1] 2
# Visualize the traces using trace_explorer
trace_explorer(patients, coverage=1)

# Draw process map
process_map(patients)
claims <- tibble(id=c("claim1", "claim1", "claim2", "claim2", "claim2"), 
                 action=c(10002L, 10011L, 10015L, 10024L, 10024L), 
                 action_type=c("Check Contract", "Pay Back Decision", "Check Contract", "Pay Back Decision", "Pay Back Decision"), 
                 date=as.Date(c("2008-01-12", "2008-03-22", "2008-01-13", "2008-03-23", "2008-04-14")), 
                 originator=c("Assistant 1", "Manager 2", "Assistant 6", "Manager 2", "Manager 2"), 
                 status=as.factor(c("start", "start", "start", "start", "complete"))
                 )
claims
## # A tibble: 5 x 6
##   id     action action_type       date       originator  status  
##   <chr>   <int> <chr>             <date>     <chr>       <fct>   
## 1 claim1  10002 Check Contract    2008-01-12 Assistant 1 start   
## 2 claim1  10011 Pay Back Decision 2008-03-22 Manager 2   start   
## 3 claim2  10015 Check Contract    2008-01-13 Assistant 6 start   
## 4 claim2  10024 Pay Back Decision 2008-03-23 Manager 2   start   
## 5 claim2  10024 Pay Back Decision 2008-04-14 Manager 2   complete
#create eventlog claims_log 
claims_log <- eventlog(claims,
    case_id = "id",
    activity_id = "action_type",
    activity_instance_id = "action",
    lifecycle_id = "status",
    timestamp = "date",
    resource_id = "originator")

# Print summary
summary(claims_log)
## Number of events:  5
## Number of cases:  2
## Number of traces:  1
## Number of distinct activities:  2
## Average trace length:  2.5
## 
## Start eventlog:  2008-01-12
## End eventlog:  2008-04-14
##       id               action                     action_type
##  Length:5           Length:5           Check Contract   :2   
##  Class :character   Class :character   Pay Back Decision:3   
##  Mode  :character   Mode  :character                         
##                                                              
##                                                              
##                                                              
##       date                  originator      status      .order 
##  Min.   :2008-01-12   Assistant 1:1    complete:1   Min.   :1  
##  1st Qu.:2008-01-13   Assistant 6:1    start   :4   1st Qu.:2  
##  Median :2008-03-22   Manager 2  :3                 Median :3  
##  Mean   :2008-02-28                                 Mean   :3  
##  3rd Qu.:2008-03-23                                 3rd Qu.:4  
##  Max.   :2008-04-14                                 Max.   :5
# Check activity labels
activity_labels(claims_log)
## [1] Check Contract    Pay Back Decision
## Levels: Check Contract Pay Back Decision
# Once you have an eventlog, you can access its complete metadata using the function mapping or the functions case_id, activity_id etc., to inspect individual identifiers.

Chapter 2 - Analysis Techniques

Organizational analysis:

  • Processes are always dependent on resources, even if automated (machines and algorithms can be resources)
    • Who executes the task, how specialized is the knowledge, etc.
    • resource_labels(log_hospital) # will pull out the resources
    • resources(log_hospital) # will pull out frequencies by resource
  • Can create a resource-activity matrix
    • A person who performs only a few activities is considered to be specialized in that activity
    • If only one person ever performs a specific activity, then there is a high risk of “brain drain”
    • The plot() function, applied to an event_log, will create the resource-activity matrix
    • resource_map(log_hospital) # shows arrows between the work flows

Structuredness:

  • Control-flow refers to the succession of activities
    • Each unique flow is referred to as a trace
    • Metrics include entry/exit points, length of cases, presence of activities, rework, etc.
    • log_healthcare %>% start_activities(“activity”) %>% plot()
    • log_healthcare %>% end_activities(“activity”) %>% plot()
  • Rework is when the same activity is done multiple times for the same case
    • Repetitions are when the activity is repeated after some intervening steps
    • Sel-loops are when the activity is repeated immediately after itself
  • The precedence matrix shows the relationships between the activities in a more structured manner
    • eventlog %>% precedence_matrix(type = “absolute”) %>% plot # can be type=“relative” also

Performance analysis:

  • Visuals can include performance process maps and dotted charts; metrics can include throughput time, processing time, idle time
    • eventlog %>% process_map(type = frequency()) # normal process map
    • eventlog %>% process_map(type = performance()) # performance process map
  • The dotted chart shows the freqency of activities over time; basically, a form of scatter plot
    • throughput_time is total time, processing_time is the sum of activity time, idle_time is the sume of when nothing is happening

Linking perspectives:

  • Granularity can help give the statistics at the desired levels
    • (level = “log”, “trace”, “case”, “activity”, “resource”, “resource-activity”)
  • Categorical data can be leveraged using the group_by() functionality - each group will then be calculated separately
    • eventlog %>% group_by(priority) %>% number_of_repetitions(level = “resource”) %>% plot()

Example code includes:

data(sepsis, package="eventdataR")
str(sepsis)
## Classes 'eventlog', 'tbl_df', 'tbl' and 'data.frame':    15214 obs. of  34 variables:
##  $ case_id                  : chr  "A" "A" "A" "A" ...
##  $ activity                 : Factor w/ 16 levels "Admission IC",..: 4 10 3 9 6 5 8 7 2 3 ...
##  $ lifecycle                : Factor w/ 1 level "complete": 1 1 1 1 1 1 1 1 1 1 ...
##  $ resource                 : Factor w/ 26 levels "?","A","B","C",..: 2 3 3 3 4 2 2 2 5 3 ...
##  $ timestamp                : POSIXct, format: "2014-10-22 11:15:41" "2014-10-22 11:27:00" ...
##  $ age                      : int  85 NA NA NA NA NA NA NA NA NA ...
##  $ crp                      : num  NA NA 210 NA NA NA NA NA NA 1090 ...
##  $ diagnose                 : chr  "A" NA NA NA ...
##  $ diagnosticartastrup      : chr  "true" NA NA NA ...
##  $ diagnosticblood          : chr  "true" NA NA NA ...
##  $ diagnosticecg            : chr  "true" NA NA NA ...
##  $ diagnosticic             : chr  "true" NA NA NA ...
##  $ diagnosticlacticacid     : chr  "true" NA NA NA ...
##  $ diagnosticliquor         : chr  "false" NA NA NA ...
##  $ diagnosticother          : chr  "false" NA NA NA ...
##  $ diagnosticsputum         : chr  "false" NA NA NA ...
##  $ diagnosticurinaryculture : chr  "true" NA NA NA ...
##  $ diagnosticurinarysediment: chr  "true" NA NA NA ...
##  $ diagnosticxthorax        : chr  "true" NA NA NA ...
##  $ disfuncorg               : chr  "true" NA NA NA ...
##  $ hypotensie               : chr  "true" NA NA NA ...
##  $ hypoxie                  : chr  "false" NA NA NA ...
##  $ infectionsuspected       : chr  "true" NA NA NA ...
##  $ infusion                 : chr  "true" NA NA NA ...
##  $ lacticacid               : chr  NA NA NA "2.2" ...
##  $ leucocytes               : chr  NA "9.6" NA NA ...
##  $ oligurie                 : chr  "false" NA NA NA ...
##  $ sirscritheartrate        : chr  "true" NA NA NA ...
##  $ sirscritleucos           : chr  "false" NA NA NA ...
##  $ sirscrittachypnea        : chr  "true" NA NA NA ...
##  $ sirscrittemperature      : chr  "true" NA NA NA ...
##  $ sirscriteria2ormore      : chr  "true" NA NA NA ...
##  $ activity_instance_id     : chr  "1" "2" "3" "4" ...
##  $ .order                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  - attr(*, "case_id")= chr "case_id"
##  - attr(*, "activity_id")= chr "activity"
##  - attr(*, "activity_instance_id")= chr "activity_instance_id"
##  - attr(*, "lifecycle_id")= chr "lifecycle"
##  - attr(*, "resource_id")= chr "resource"
##  - attr(*, "timestamp")= chr "timestamp"
# Print list of resources
resource_frequency(sepsis, level="resource")
## # A tibble: 26 x 3
##    resource absolute relative
##    <fct>       <int>    <dbl>
##  1 B            8111  0.533  
##  2 A            3462  0.228  
##  3 C            1053  0.0692 
##  4 E             782  0.0514 
##  5 ?             294  0.0193 
##  6 F             216  0.0142 
##  7 L             213  0.0140 
##  8 O             186  0.0122 
##  9 G             148  0.00973
## 10 I             126  0.00828
## # ... with 16 more rows
# Number of resources per activity
resource_frequency(sepsis, level = "activity")
## # A tibble: 16 x 11
##    activity nr_of_resources   min     q1   mean median     q3   max st_dev
##    <fct>              <int> <int>  <dbl>  <dbl>  <dbl>  <dbl> <int>  <dbl>
##  1 Admissi~               4     1 7.00e0 2.92e1 3.10e1 5.32e1    54   28.2
##  2 Admissi~              20     1 1.70e1 5.91e1 4.05e1 6.82e1   216   62.7
##  3 CRP                    1  3262 3.26e3 3.26e3 3.26e3 3.26e3  3262   NA  
##  4 ER Regi~               2    65 2.95e2 5.25e2 5.25e2 7.55e2   985  651  
##  5 ER Seps~               2    65 2.95e2 5.24e2 5.24e2 7.54e2   984  650  
##  6 ER Tria~               1  1053 1.05e3 1.05e3 1.05e3 1.05e3  1053   NA  
##  7 IV Anti~               2    45 2.28e2 4.12e2 4.12e2 5.95e2   778  518  
##  8 IV Liqu~               2    38 2.07e2 3.76e2 3.76e2 5.46e2   715  479  
##  9 LacticA~               1  1466 1.47e3 1.47e3 1.47e3 1.47e3  1466   NA  
## 10 Leucocy~               1  3383 3.38e3 3.38e3 3.38e3 3.38e3  3383   NA  
## 11 Release~               1   671 6.71e2 6.71e2 6.71e2 6.71e2   671   NA  
## 12 Release~               1    56 5.60e1 5.60e1 5.60e1 5.60e1    56   NA  
## 13 Release~               1    25 2.50e1 2.50e1 2.50e1 2.50e1    25   NA  
## 14 Release~               1    24 2.40e1 2.40e1 2.40e1 2.40e1    24   NA  
## 15 Release~               1     6 6.00e0 6.00e0 6.00e0 6.00e0     6   NA  
## 16 Return ~               1   294 2.94e2 2.94e2 2.94e2 2.94e2   294   NA  
## # ... with 2 more variables: iqr <dbl>, total <int>
# Plot Number of executions per resource-activity
resource_frequency(sepsis, level = "resource-activity") %>% plot()

# Calculate resource involvement
resource_involvement(sepsis, level="resource")
## # A tibble: 26 x 3
##    resource absolute relative
##    <fct>       <int>    <dbl>
##  1 C            1050   1.00  
##  2 B            1013   0.965 
##  3 A             985   0.938 
##  4 E             782   0.745 
##  5 ?             294   0.280 
##  6 F             200   0.190 
##  7 O             179   0.170 
##  8 G             147   0.140 
##  9 I             118   0.112 
## 10 M              82   0.0781
## # ... with 16 more rows
# Show graphically 
sepsis %>% resource_involvement(level = "resource") %>% plot

# Compare with resource frequency
resource_frequency(sepsis, level="resource")
## # A tibble: 26 x 3
##    resource absolute relative
##    <fct>       <int>    <dbl>
##  1 B            8111  0.533  
##  2 A            3462  0.228  
##  3 C            1053  0.0692 
##  4 E             782  0.0514 
##  5 ?             294  0.0193 
##  6 F             216  0.0142 
##  7 L             213  0.0140 
##  8 O             186  0.0122 
##  9 G             148  0.00973
## 10 I             126  0.00828
## # ... with 16 more rows
# Min, max and average number of repetitions
sepsis %>% number_of_repetitions(level = "log")
## Using default type: all
##      min       q1   median     mean       q3      max   st_dev      iqr 
## 0.000000 0.000000 2.000000 1.640000 3.000000 5.000000 1.280461 3.000000 
## attr(,"type")
## [1] "all"
# Plot repetitions per activity
sepsis %>% number_of_repetitions(level = "activity") %>% plot
## Using default type: all

# Number of repetitions per resources
sepsis %>% number_of_repetitions(level = "resource")
## Using default type: all
## # resource_metric [26 x 3]
##    first_resource absolute relative
##    <fct>             <dbl>    <dbl>
##  1 ?                  0     0      
##  2 A                  0     0      
##  3 B               1536     0.189  
##  4 C                  3.00  0.00285
##  5 D                  0     0      
##  6 E                  0     0      
##  7 F                 16.0   0.0741 
##  8 G                 67.0   0.453  
##  9 H                  6.00  0.109  
## 10 I                 12.0   0.0952 
## # ... with 16 more rows
eci <- c('21', '21', '21', '21', '21', '21', '21', '21', '21', '31', '31', '31', '31', '31', '31', '31', '31', '31', '31', '41', '41', '41', '41', '41', '41', '41', '51', '51', '51', '51', '51', '51', '51', '61', '61', '61', '61', '61', '61', '91', '91', '91', '91', '91', '91', '101', '101', '101', '101', '101', '101', '111', '111', '111', '111', '121', '121', '121', '121', '121', '121', '121', '121', '121', '131', '131', '131', '131', '131', '131', '131', '131', '161', '161', '171', '171', '171', '171', '181', '181', '181', '181', '181', '181', '201', '201', '201', '201', '201', '201', '201', '12', '12', '12', '12', '12', '22', '22', '22', '22', '22', '22', '32', '32', '32', '32', '32', '32', '42', '42', '42', '42', '52', '52', '52', '52', '52', '82', '82', '82', '82', '82', '92', '92', '92', '92', '92', '102', '102', '102', '102', '102', '112', '112', '122', '122', '21', '21', '21', '21', '21', '21', '21', '21', '21', '31', '31', '31', '31', '31', '31', '31', '31', '31', '31', '41', '41', '41', '41', '41', '41', '41', '51', '51', '51', '51', '51', '51', '51', '61', '61', '61', '61', '61', '61', '91', '91', '91', '91', '91', '91', '101', '101', '101', '101', '101', '101', '111', '111', '111', '111', '121', '121', '121', '121', '121', '121', '121', '121', '121', '131', '131', '131', '131', '131', '131', '131', '131', '161', '161', '171', '171', '171', '171', '181', '181', '181', '181', '181', '181', '201', '201', '201', '201', '201', '201', '201', '12', '12', '12', '12', '12', '22', '22', '22', '22', '22', '22', '32', '32', '32', '32', '32', '32', '42', '42', '42', '42', '52', '52', '52', '52', '52', '82', '82', '82', '82', '82', '92', '92', '92', '92', '92', '102', '102', '102', '102', '102', '112', '112', '122', '122')
ea1 <- c('prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'snack', 'prepareDinner', 'eatingDinner', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'snack', 'prepareDinner', 'eatingDinner', 'prepareBreakfast', 'eatingBreakfast', 'prepareBreakfast', 'eatingBreakfast', 'prepareDinner', 'eatingDinner', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'prepareDinner', 'eatingDinner', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareDinner', 'eatingDinner', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'snack', 'prepareLunch', 'eatingLunch', 'prepareBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'snack', 'snack', 'eatingBreakfast', 'prepareBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'snack', 'eatingBreakfast', 'prepareBreakfast', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'snack', 'snack', 'eatingDinner', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareDinner', 'eatingDinner', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'eatingLunch', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'prepareBreakfast')
ea2 <- c('eatingBreakfast', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'snack', 'prepareDinner', 'eatingDinner', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'snack', 'prepareDinner', 'eatingDinner', 'prepareBreakfast', 'eatingBreakfast', 'prepareBreakfast', 'eatingBreakfast', 'prepareDinner', 'eatingDinner', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'prepareDinner', 'eatingDinner', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareDinner', 'eatingDinner', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'snack', 'prepareLunch', 'eatingLunch', 'prepareBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'snack', 'snack', 'eatingBreakfast', 'prepareBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'snack', 'eatingBreakfast', 'prepareBreakfast', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'snack', 'snack', 'eatingDinner', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareDinner', 'eatingDinner', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'snack', 'prepareLunch', 'eatingLunch', 'snack', 'eatingLunch', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'prepareLunch', 'eatingLunch', 'snack', 'prepareBreakfast', 'eatingBreakfast', 'prepareBreakfast', 'eatingBreakfast')
eaii <- c('9', '10', '19', '23', '24', '26', '36', '40', '41', '51', '52', '58', '60', '62', '63', '67', '69', '72', '73', '86', '87', '89', '90', '104', '105', '107', '119', '120', '128', '132', '133', '138', '139', '149', '150', '156', '159', '160', '164', '174', '175', '192', '194', '195', '198', '205', '206', '208', '211', '213', '214', '229', '236', '237', '239', '245', '251', '252', '253', '255', '259', '260', '262', '264', '271', '276', '281', '287', '292', '293', '297', '299', '310', '312', '331', '332', '336', '347', '363', '364', '374', '376', '387', '389', '434', '435', '447', '448', '450', '453', '454', '462', '463', '471', '472', '475', '483', '484', '487', '491', '492', '496', '508', '509', '512', '517', '518', '522', '536', '540', '541', '543', '562', '563', '565', '566', '572', '584', '585', '589', '590', '598', '615', '616', '618', '619', '627', '639', '640', '642', '643', '653', '665', '666', '682', '683', '9', '10', '19', '23', '24', '26', '36', '40', '41', '51', '52', '58', '60', '62', '63', '67', '69', '72', '73', '86', '87', '89', '90', '104', '105', '107', '119', '120', '128', '132', '133', '138', '139', '149', '150', '156', '159', '160', '164', '174', '175', '192', '194', '195', '198', '205', '206', '208', '211', '213', '214', '229', '236', '237', '239', '245', '251', '252', '253', '255', '259', '260', '262', '264', '271', '276', '281', '287', '292', '293', '297', '299', '310', '312', '331', '332', '336', '347', '363', '364', '374', '376', '387', '389', '434', '435', '447', '448', '450', '453', '454', '462', '463', '471', '472', '475', '483', '484', '487', '491', '492', '496', '508', '509', '512', '517', '518', '522', '536', '540', '541', '543', '562', '563', '565', '566', '572', '584', '585', '589', '590', '598', '615', '616', '618', '619', '627', '639', '640', '642', '643', '653', '665', '666', '682', '683')
elci <- c('start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'start', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete', 'complete')
ets1 <- c('2012-11-12 09:42:02', '2012-11-12 09:52:33', '2012-11-12 11:05:44', '2012-11-12 13:45:49', '2012-11-12 13:48:49', '2012-11-12 15:23:00', '2012-11-12 18:47:29', '2012-11-12 22:35:21', '2012-11-12 22:35:21', '2012-11-13 08:56:37', '2012-11-13 09:04:54', '2012-11-13 10:14:04', '2012-11-13 13:47:45', '2012-11-13 14:08:24', '2012-11-13 14:19:01', '2012-11-13 17:34:23', '2012-11-13 18:51:51', '2012-11-13 23:05:07', '2012-11-13 23:17:07', '2012-11-14 09:06:08', '2012-11-14 09:17:48', '2012-11-14 10:38:16', '2012-11-14 10:44:16', '2012-11-14 21:30:09', '2012-11-14 21:37:09', '2012-11-14 22:14:23', '2012-11-15 09:37:15', '2012-11-15 09:47:12', '2012-11-15 10:11:08', '2012-11-15 14:35:27', '2012-11-15 14:41:27', '2012-11-15 22:07:26', '2012-11-15 22:26:02', '2012-11-16 10:39:14', '2012-11-16 10:52:56', '2012-11-16 12:09:10', '2012-11-16 14:13:00', '2012-11-16 14:19:00', '2012-11-16 18:11:36', '2012-11-19 10:13:23', '2012-11-19 10:25:00', '2012-11-19 15:55:22', '2012-11-19 21:47:27', '2012-11-19 21:59:27', '2012-11-19 22:31:06', '2012-11-20 10:20:00', '2012-11-20 10:21:02', '2012-11-20 11:00:16', '2012-11-20 13:03:28', '2012-11-20 14:25:11', '2012-11-20 14:41:22', '2012-11-21 10:01:00', '2012-11-21 15:02:08', '2012-11-21 15:15:08', '2012-11-21 17:50:29', '2012-11-22 01:40:42', '2012-11-22 10:19:15', '2012-11-22 10:26:15', '2012-11-22 11:02:27', '2012-11-22 11:56:06', '2012-11-22 15:05:51', '2012-11-22 15:12:55', '2012-11-22 16:43:08', '2012-11-22 18:15:32', '2012-11-23 00:36:00', '2012-11-23 01:03:00', '2012-11-23 09:49:00', '2012-11-23 12:53:06', '2012-11-23 14:01:08', '2012-11-23 14:23:08', '2012-11-23 16:57:24', '2012-11-23 17:58:00', '2012-11-26 09:06:12', '2012-11-26 09:57:12', '2012-11-27 10:20:26', '2012-11-27 10:30:50')
ets2 <- c('2012-11-27 11:54:15', '2012-11-27 19:46:15', '2012-11-28 09:27:15', '2012-11-28 09:34:15', '2012-11-28 12:28:02', '2012-11-28 13:16:33', '2012-11-28 19:30:08', '2012-11-28 22:15:02', '2012-11-30 10:43:19', '2012-11-30 10:46:19', '2012-11-30 14:51:36', '2012-11-30 15:08:36', '2012-11-30 17:30:40', '2012-11-30 22:12:05', '2012-11-30 22:16:07', '2011-11-28 10:38:00', '2011-11-28 10:43:00', '2011-11-28 14:31:06', '2011-11-28 14:42:00', '2011-11-28 20:20:55', '2011-11-29 12:09:09', '2011-11-29 12:11:01', '2011-11-29 13:25:29', '2011-11-29 15:15:14', '2011-11-29 15:23:00', '2011-11-29 16:32:20', '2011-11-30 10:23:46', '2011-11-30 10:28:46', '2011-11-30 13:05:27', '2011-11-30 14:39:42', '2011-11-30 14:56:00', '2011-11-30 16:41:05', '2011-11-30 14:37:00', '2011-12-01 11:17:05', '2011-12-01 11:20:05', '2011-12-01 14:29:37', '2011-12-02 12:29:08', '2011-12-02 12:32:08', '2011-12-02 14:47:18', '2011-12-02 14:51:00', '2011-12-02 19:40:44', '2011-12-05 12:15:45', '2011-12-05 12:18:05', '2011-12-05 15:00:55', '2011-12-05 15:14:00', '2011-12-05 19:24:11', '2011-12-06 11:30:19', '2011-12-06 11:33:02', '2011-12-06 14:41:16', '2011-12-06 14:56:00', '2011-12-06 19:22:50', '2011-12-07 11:12:17', '2011-12-07 11:17:22', '2011-12-07 14:04:32', '2011-12-07 14:14:00', '2011-12-07 19:23:55', '2011-12-08 11:25:12', '2011-12-08 11:29:01', '2011-12-09 11:00:13', '2011-12-09 11:03:33', '2012-11-12 09:50:02', '2012-11-12 09:55:29', '2012-11-12 12:39:42', '2012-11-12 14:48:14', '2012-11-12 14:53:14', '2012-11-12 15:31:53', '2012-11-12 19:00:56', '2012-11-12 22:37:55', '2012-11-12 22:40:55', '2012-11-13 09:00:26', '2012-11-13 09:10:12', '2012-11-13 10:51:55', '2012-11-13 14:03:31', '2012-11-13 14:18:36', '2012-11-13 14:42:36', '2012-11-13 17:36:34', '2012-11-13 19:45:03', '2012-11-13 23:15:33', '2012-11-13 23:37:33', '2012-11-14 09:09:41', '2012-11-14 09:21:43', '2012-11-14 11:43:23', '2012-11-14 11:06:23', '2012-11-14 21:35:17', '2012-11-14 21:47:18', '2012-11-14 22:17:47', '2012-11-15 09:44:06', '2012-11-15 09:48:08', '2012-11-15 10:23:49', '2012-11-15 15:40:32', '2012-11-15 15:46:32', '2012-11-15 22:22:44', '2012-11-15 22:31:00', '2012-11-16 10:42:13') 
ets3 <- c('2012-11-16 10:52:58', '2012-11-16 12:09:57', '2012-11-16 14:58:55', '2012-11-16 14:55:55', '2012-11-16 18:14:49', '2012-11-19 10:17:12', '2012-11-19 10:33:59', '2012-11-19 16:07:49', '2012-11-19 21:59:01', '2012-11-19 22:24:58', '2012-11-19 22:31:59', '2012-11-20 10:21:02', '2012-11-20 10:37:51', '2012-11-20 11:14:44', '2012-11-20 13:28:35', '2012-11-20 14:40:16', '2012-11-20 15:10:16', '2012-11-21 10:06:50', '2012-11-21 15:14:47', '2012-11-21 15:30:55', '2012-11-21 17:55:48', '2012-11-22 01:45:42', '2012-11-22 10:25:45', '2012-11-22 10:59:45', '2012-11-22 11:10:30', '2012-11-22 12:09:07', '2012-11-22 15:12:19', '2012-11-22 15:26:18', '2012-11-22 16:51:54', '2012-11-22 18:17:25', '2012-11-23 00:41:13', '2012-11-23 10:28:57', '2012-11-23 10:01:57', '2012-11-23 12:57:33', '2012-11-23 14:20:47', '2012-11-23 14:38:47', '2012-11-23 16:57:43', '2012-11-23 18:06:38', '2012-11-26 10:37:28', '2012-11-26 10:05:28', '2012-11-27 10:30:43', '2012-11-27 10:44:43', '2012-11-27 11:54:59', '2012-11-27 19:46:56', '2012-11-28 09:33:52', '2012-11-28 09:44:52', '2012-11-28 12:57:42', '2012-11-28 13:38:45', '2012-11-28 19:45:20', '2012-11-28 22:18:43', '2012-11-30 11:45:40', '2012-11-30 11:51:40', '2012-11-30 15:05:54', '2012-11-30 15:20:00', '2012-11-30 17:42:59', '2012-11-30 22:15:48', '2012-11-30 22:39:48', '2011-11-28 10:42:55', '2011-11-28 10:49:00', '2011-11-28 14:41:54', '2011-11-28 15:04:00', '2011-11-28 20:20:59', '2011-11-29 12:10:37', '2011-11-29 12:19:00', '2011-11-29 13:25:32', '2011-11-29 15:22:57', '2011-11-29 15:49:00', '2011-11-29 16:32:23', '2011-11-30 10:27:58', '2011-11-30 10:38:58', '2011-11-30 13:05:31', '2011-11-30 14:55:24', '2011-11-30 15:11:00', '2011-11-30 16:41:09', '2011-11-30 15:08:00', '2011-12-01 11:19:43', '2011-12-01 11:29:43', '2011-12-01 14:36:38', '2011-12-02 12:31:10', '2011-12-02 12:37:10', '2011-12-02 14:50:19', '2011-12-02 15:24:00', '2011-12-02 19:40:50', '2011-12-05 12:17:58', '2011-12-05 12:26:02', '2011-12-05 15:13:55', '2011-12-05 15:42:00', '2011-12-05 19:24:16', '2011-12-06 11:32:49', '2011-12-06 11:38:51', '2011-12-06 14:55:18', '2011-12-06 15:18:18', '2011-12-06 19:22:55', '2011-12-07 11:17:14', '2011-12-07 11:22:35', '2011-12-07 14:13:34', '2011-12-07 14:41:00', '2011-12-07 20:38:18', '2011-12-08 11:28:24', '2011-12-08 11:35:55', '2011-12-09 11:03:09', '2011-12-09 11:09:08')
etsF <- c(ets1, ets2, ets3)

eatData <- tibble(case_id=eci, 
                  activity=factor(c(ea1, ea2)), 
                  activity_instance_id=eaii, 
                  lifecycle_id=factor(elci), 
                  resource=factor("UNDEFINED"), 
                  timestamp=as.POSIXct(etsF)
                  )

eat_patterns <- eventlog(eatData,
    case_id = "case_id",
    activity_id = "activity",
    activity_instance_id = "activity_instance_id",
    lifecycle_id = "lifecycle_id",
    timestamp = "timestamp",
    resource_id = "resource")


# Create performance map
eat_patterns %>% process_map(type = performance(FUN = median, units = "hours"))
# Inspect variation in activity durations graphically
eat_patterns %>% processing_time(level = "activity") %>% plot()

# Draw dotted chart
eat_patterns %>% dotted_chart(x = "relative_day", sort = "start_day", units = "secs")

# Time per activity
# daily_activities %>% processing_time(level = "activity") %>% plot

# Average duration of recordings
# daily_activities %>% throughput_time(level="log", units = "hours")

# Missing activities
# daily_activities %>% idle_time(level="log", units = "hours")


# Distribution throughput time
# vacancies %>% throughput_time(units="days")

# Distribution throughput time per department
# vacancies %>% group_by(vacancy_department) %>% throughput_time(units="days") %>% plot()

# Repetitions of activities
# vacancies %>% number_of_repetitions(level = "activity") %>% arrange(-relative)

Chapter 3 - Event Data Processing

Filtering cases:

  • Sometimes there are too many cases, too many activities, missing data, and the like
    • Can filter by either cases or events (time periods or specific activity types)
    • Three levels of cases - performance, control-flow, and time frame
  • Look at long cases for what went wrong, and short cases for what to mimic
    • filter_throughput_time(log, interval = c(5,10)) # absolute case length is 5-10 days
    • filter_throughput_time(log, percentage = 0.5) # shortest 50% of the cases
    • filter_throughput_time(log, interval = c(5,10), units = “days”, reverse =TRUE) # cases that are NOT 5-10 days
    • filter_throughput_time(log, interval = c(5,NA), units = “days”) # cases longer than 5 days
  • Control-flow filters can be based on activity presence/absence, timing, and the like

Filtering events - trim, frequency, label, general attribute:

  • Can trim to a time period based on start or end
    • filter_time_period(log, interval = ymd(c(“20180110”,“20180122”)), filter_method = “trim”) # discards everything else
  • Can trim based on a specific start and end activities
    • filter_trim(start_activities = “blues”) # traces that have no blues will be discarded
    • filter_trim(start_activities = “blues”, end_activities = “greens”) # traces that do not have blues followed by greens will be discarded
    • Can set reverse=TRUE to get the opposites of these
  • Can filter by frequencies by either activity or resource
    • filter_activity_frequency(log, interval = c(50,100))
    • filter_activity_frequency(log, percentage = 0.8)
    • filter_resource_frequency(log, interval = c(60,900))
    • filter_resource_frequency(log, percentage = 0.6)
  • Can filter by labels
    • filter_activity(log, activities = c(“reds”,“oranges”,“purples”)))
    • dplyr::filter(log, cost > 1000, priority == “High”, …)

Aggregating events - Is-A and Part-of:

  • The Is-A is when there are many subtypes of activity that are really all part of a main activity
    • act_unite(log, “New name” = c(“Old Variant 1”,“Old Variant 2”,“Old Variant 3”), …) # same number of activity instances, just fewer names
  • The Part-of is when there are clearly distinct activities that can also be considered components of a higher-level activity
    • act_collapse(log, “Sub process” = c(“Part 1”,“Part 2”,“Part 3”), …) # fewer number of activity instances, as they are collapsed to a single activity

Enriching events - mutation (adding calculated variables):

  • The dplyr::mutate() can be used to directly add variables such as the cost
    • log %>% group_by_case() %>% mutate(total_cost = sum(cost, na.rm = TRUE) # group_by_case() is a function applied to event logs
    • log %>% group_by_case() %>% mutate(total_cost = sum(cost, na.rm = TRUE) %>% mutate(impact = case_when(cost <= 1000 ~ “Low”, cost <= 5000 ~ “Medium”, TRUE ~ “High”))
    • log %>% group_by_case() %>% mutate(refund_made = any(str_detect(activity, “Pay Claim”)))
  • Metric functions can be used directly, with apped=TRUE, to both calculate the metric and add to the event log
    • log %>% througput_time(level = “case”, units = “days”, append = TRUE) %>% mutate(on_time = processing_time_case <= 7)

Example code includes:

# Select top 20% of cases according to trace frequency
happy_path <- filter_trace_frequency(vacancies, percentage = 0.2)

# Visualize using process map
happy_path %>% process_map(type=requency(value = "absolute_case"))

# Compute throughput time
happy_path %>% throughput_time(units="days")


# Find no_declines
no_declines <- filter_activity_presence(vacancies, activities = "Decline Candidate", reverse=TRUE)

# What is the average number of  
first_hit <- filter_activity_presence(vacancies, activities = c("Send Offer", "Offer Accepted"), method="all")

# Create a performance map
first_hit %>% process_map(type=performance())

# Compute throughput time
first_hit %>% throughput_time()


# Create not_refused
not_refused <- vacancies %>% filter_precedence(antecedents = "Receive Response", consequents = "Review Non Acceptance", precedence_type = "directly_follows", filter_method = "none") 

# Select longest_cases
worst_cases <- not_refused %>% filter_throughput_time(interval=c(300, NA))

# Show the different traces
worst_cases %>% trace_explorer(coverage=1)


# Select activities
disapprovals <- vacancies %>% filter_activity(activities=c("Construct Offer", "Disapprove Offer", "Revise Offer","Disapprove Revision", "Restart Procedure"))

# Explore traces
disapprovals %>% trace_explorer(coverage=0.8)

# Performance map
disapprovals %>% process_map(type = performance(FUN = sum, units = "weeks"))


# Select cases
high_paid <- vacancies %>% filter(vacancy_department=="R&D", vacancy_salary_range==">100000")

# Most active resources
high_paid %>% resource_frequency(level="resource")

# Create a dotted chart
high_paid %>% dotted_chart(x="absolute", sort="start")

# Filtered dotted chart
library(lubridate)
high_paid %>% filter_time_period(interval = ymd(c("20180321","20180620")), filter_method = "trim") %>% dotted_chart(x="absolute", sort="start")


# Count activities and instances
n_activities(vacancies)
n_activity_instances(vacancies)

# Combine activities
united_vacancies <- vacancies %>% 
    act_unite("Disapprove Contract Offer" = c("Disapprove Offer","Disapprove Revision"),
              "Approve Contract Offer" = c("Approve Offer","Approve Revision"), 
              "Construct Contract Offer" = c("Construct Offer","Revise Offer")
              )
              
# Count activities and instances
n_activities(united_vacancies)
n_activity_instances(united_vacancies)


# Aggregate sub processes
aggregated_vacancies <- act_collapse(united_vacancies, 
                            "Interviews" = c("First Interview","Second Interview","Third Interview"),
                            "Prepare Recruitment" = c("Publish Position","File Applications","Check References"),
                            "Create Offer" = c("Construct Contract Offer", "Disapprove Contract Offer", "Approve Contract Offer")
                            )

# Calculated number of activities and activity instances
n_activities(aggregated_vacancies)
n_activity_instances(aggregated_vacancies)

# Create performance map
aggregated_vacancies %>% process_map(type=performance())


# Add total_cost
vacancies_cost <- vacancies %>% 
    group_by_case() %>% 
    mutate(total_cost = sum(activity_cost, na.rm = TRUE))

# Add cost_impact
vacancies_impact <- vacancies_cost %>%




# Compute throughput time per impact
vacancies_impact %>% group_by(cost_impact) %>% throughput_time(units = "weeks") %>% plot()


# Create cost_profile
vacancies_profile <- vacancies_impact %>%
    mutate(cost_profile = case_when(cost_impact == "High" & urgency < 7 ~ "Disproportionate",
                                    cost_impact == "Medium" & urgency < 5 ~ "Excessive",
                                    cost_impact == "Low" & urgency > 6 ~ "Lacking",
                                    TRUE ~ "Appropriate")) 

# Compare number of cases 
vacancies_profile %>% 
    group_by(cost_profile) %>%
    n_cases()
    
# Explore lacking traces
vacancies_profile %>%
  filter(cost_profile == "Lacking") %>%
  process_map()

Chapter 4 - Case Study

Preparing the event data - example includes data from Sales, Purchasing, Manufacturing, Packaging & Delivery, Accounting:

  • While all departments need to work together, it is common for each department to have different data, business rules, relational data, etc.
  • Need to create event data first prior to running anything in the bupar package
  • Various field names (ends in _at or _by) may indicate the timing and resource levels
  • The tidyverse tools are helpful for creating the initial data

Getting to know the process:

  • Identify data sources, transform so that each row is an event, harmonize them, create an eventlog
  • Start with high-level understanding of the process - summary(otc)
    • activity_presence(otc) %>% plot()
    • trace_length(otc) %>% plot()
    • start_activities(otc, “activity”) %>% plot()
    • end_activities(otc, “activity”) %>% plot()

Roles and rules:

  • Parallel activities can be run in any order, which can cause an explosion in the number of traces - collapsing can help with abstraction
  • Research questions may be related to performance, compliance, etc.
  • The “4-eye” pricniple says that certain activities should not be performed by the same person

Fast production, fast delivery:

  • Dotted charts can show the progression of the cases - request for quotation may be declined, or the offer may only be sent (no response)
  • May want to look at the performance by stages (sub-groups of activities), for more fair comparisons

Course recap:

  • Process maps
  • Process analytics
  • Data preprocessing
  • Analysis and use cases

Example code includes:

quotations <- readRDS("./RInputFiles/otc_quotations.RDS")

# Inspect quotations
str(quotations)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1833 obs. of  17 variables:
##  $ quotation_id            : chr  "quo-1003" "quo-1004" "quo-1006" "quo-1008" ...
##  $ cancelled_at            : chr  "2017-05-22 13:28:04" NA NA NA ...
##  $ cancelled_by            : Factor w/ 20 levels "Amy","Andrea",..: 10 NA NA NA 8 NA NA NA NA NA ...
##  $ manufactContacted_at    : chr  "2017-04-22 17:58:11" "2017-06-18 13:47:50" "2017-10-28 13:55:51" NA ...
##  $ manufactContacted_by    : Factor w/ 20 levels "Amy","Andrea",..: 11 11 11 NA NA NA 11 14 NA NA ...
##  $ received_at             : chr  "2017-04-16 20:34:12" "2017-06-09 11:19:31" "2017-10-14 18:55:47" "2017-09-08 13:29:05" ...
##  $ received_by             : Factor w/ 20 levels "Amy","Andrea",..: 2 8 8 8 8 8 10 8 2 2 ...
##  $ reminded_at             : chr  "2017-05-14 19:06:41" NA NA NA ...
##  $ reminded_by             : Factor w/ 20 levels "Amy","Andrea",..: 8 NA NA NA 8 NA 8 8 NA NA ...
##  $ send_at                 : chr  "2017-05-08 14:20:30" "2017-07-02 18:50:58" "2017-11-09 11:27:11" NA ...
##  $ send_by                 : Factor w/ 20 levels "Amy","Andrea",..: 10 2 2 NA 2 NA 2 2 NA 2 ...
##  $ supplierContacted_at    : chr  "2017-04-29 13:43:18" "2017-06-20 12:19:31" "2017-10-26 18:06:29" NA ...
##  $ supplierContacted_by    : Factor w/ 20 levels "Amy","Andrea",..: 14 11 11 NA 11 NA 11 14 NA 14 ...
##  $ supplierOfferReceived_at: chr  "2017-05-03 19:09:21" "2017-06-23 19:33:10" "2017-10-30 10:36:44" NA ...
##  $ supplierOfferReceived_by: Factor w/ 20 levels "Amy","Andrea",..: 14 11 14 NA 14 NA 14 14 NA 14 ...
##  $ warehouseContacted_at   : chr  "2017-04-24 19:36:10" "2017-06-15 19:30:07" "2017-10-22 17:57:26" NA ...
##  $ warehouseContacted_by   : Factor w/ 20 levels "Amy","Andrea",..: 11 11 11 NA 14 NA 11 14 NA 14 ...
# Create offer_history
offer_history <- quotations %>%
    gather(key, value, -quotation_id) %>%
    separate(key, into = c("activity", "info"))
## Warning: attributes are not identical across measure variables;
## they will be dropped
# Recode the key variable
offer_history <- offer_history %>%
    mutate(info = fct_recode(info,  "timestamp" = 'at',  "resource" = 'by'))

# Spread the info variable
offer_history <- offer_history %>%
    spread(info, value)


validations <- readRDS("./RInputFiles/otc_validations.RDS")

# Inspect validations
str(validations)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1833 obs. of  4 variables:
##  $ quotation_id: chr  "quo-1003" "quo-1004" "quo-1006" "quo-1008" ...
##  $ resource    : chr  "Jonathan" "Andrea" "Katherine" "Andrea" ...
##  $ started     : chr  "2017-04-17 14:59:08" "2017-06-11 13:10:45" "2017-10-16 15:59:18" "2017-09-09 17:58:39" ...
##  $ completed   : chr  "2017-04-19 18:32:57" "2017-06-13 12:18:57" "2017-10-18 16:21:56" "2017-09-12 20:58:14" ...
# Create validate_history
validate_history <- validations %>%
    mutate(
        activity = "Validate",
        action = paste(quotation_id, "validate",  sep = "-"))

# Gather the timestamp columns
validate_history <- validate_history  %>%
    gather(lifecycle, timestamp, started, completed)


# Recode the lifecycle column of validate_history
validate_history <- validate_history %>%
    mutate(lifecycle = fct_recode(lifecycle,
                "start" = "started",
                "complete" = "completed"))


# Add lifecycle and action column to offer_history
offer_history <- offer_history %>%
    mutate(
        lifecycle = "complete",
        action = paste(quotation_id, 1:n(), sep = "-"))

# Create sales_history
sales_history <- bind_rows(validate_history, offer_history)
## Warning in bind_rows_(x, .id): binding factor and character vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
sales_history <- readRDS("./RInputFiles/otc_sales_history.RDS")
order_history <- readRDS("./RInputFiles/otc_order_history.RDS")
# sales_quotations <- readRDS("./RInputFiles/otc_sales_quotation.RDS")

str(sales_history)
## Classes 'tbl_df', 'tbl' and 'data.frame':    14695 obs. of  7 variables:
##  $ quotation_id  : chr  "quo-1003" "quo-1004" "quo-1006" "quo-1008" ...
##  $ resource      : chr  "Jonathan" "Andrea" "Katherine" "Andrea" ...
##  $ activity      : chr  "Validate" "Validate" "Validate" "Validate" ...
##  $ action        : chr  "quo-1003-validate" "quo-1004-validate" "quo-1006-validate" "quo-1008-validate" ...
##  $ lifecycle     : chr  "start" "start" "start" "start" ...
##  $ timestamp     : chr  "2017-04-17 14:59:08" "2017-06-11 13:10:45" "2017-10-16 15:59:18" "2017-09-09 17:58:39" ...
##  $ sales_order_id: chr  NA "order-17-56548" "order-17-56550" NA ...
str(order_history)
## Classes 'tbl_df', 'tbl' and 'data.frame':    60804 obs. of  8 variables:
##  $ sales_order_id: chr  "order-17-56542" "order-17-56542" "order-17-56543" "order-17-56543" ...
##  $ action        : chr  "order-17-56542-0000001" "order-17-56542-0000002" "order-17-56543-0000003" "order-17-56543-0000004" ...
##  $ activity      : Factor w/ 37 levels "Assemble Order",..: 24 35 24 35 24 35 24 35 24 35 ...
##  $ resource      : Factor w/ 20 levels "Amy","Andrea",..: 10 8 2 8 2 8 10 8 2 8 ...
##  $ status        : Factor w/ 2 levels "complete","start": 2 2 2 2 2 2 2 2 2 2 ...
##  $ time          : POSIXct, format: "2017-10-17 12:37:22" "2017-10-19 15:30:40" ...
##  $ activity_cost : num  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
##  $ quotation_id  : chr  NA NA NA NA ...
# str(sales_quotations)

order_history <- order_history %>% 
    rename(timestamp=time, lifecycle=status) %>%
    select(-activity_cost) %>%
    mutate(activity=as.character(activity), 
           resource=as.character(activity), 
           lifecycle=as.character(lifecycle)
           )
sales_history <- sales_history %>%
    mutate(timestamp=lubridate::as_datetime(timestamp))

# sales_history <- sales_history %>% left_join(sales_quotations)
otc <- bind_rows(sales_history, order_history)


# Create the eventlog object 
otc <- otc %>%
    mutate(case_id = paste(quotation_id, sales_order_id, sep = "-")) %>%
    eventlog(
        case_id = "case_id",
        activity_id = "activity",
        activity_instance_id = "action",
        timestamp = "timestamp",
        resource_id = "resource",
        lifecycle_id = "lifecycle"
        )

# Create trace coverage graph
trace_coverage(otc, level="trace") %>% plot()

# Explore traces
otc %>%
    trace_explorer(coverage = 0.25)

# Collapse activities
otc_high_level <- act_collapse(otc, "Delivery" = c(
  "Handover To Deliverer",
  "Order Delivered",
  "Present For Collection",
  "Order Fetched")
  )

# Draw a process map
process_map(otc_high_level)
# Redraw the trace coverage graph
otc_high_level %>% trace_coverage(level="trace") %>% plot()

# Compute activity wise processing time
otc_high_level %>% processing_time(level="activity", units="days")
## # A tibble: 34 x 11
##    activity          min    q1  mean median    q3   max st_dev   iqr total
##    <fct>           <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
##  1 Packaging       0      0     0      0     0      0     0     0        0
##  2 Prepare Invoice 0      0     0      0     0      0     0     0        0
##  3 Produce Order   0      0     0      0     0      0     0     0        0
##  4 Quality Control 0      0     0      0     0      0     0     0        0
##  5 Assemble Order  0      0     0      0     0      0     0     0        0
##  6 Delivery        0.583  1.99  5.11   3.11  8.06  17.0   3.86  6.07 15452
##  7 Order Materials 0      0     0      0     0      0     0     0        0
##  8 Receive Materi~ 0      0     0      0     0      0     0     0        0
##  9 Receive Sales ~ 0      0     0      0     0      0     0     0        0
## 10 Schedule Job    0      0     0      0     0      0     0     0        0
## # ... with 24 more rows, and 1 more variable: relative_frequency <dbl>
# Plot a resource activity matrix of otc
otc %>% resource_frequency(level = "resource-activity") %>% plot()

# Create otc_selection
otc_selection <- otc %>% filter_activity(activities = c("Send Quotation","Send Invoice"))

# Explore traces
otc %>% trace_explorer(coverage=1)

# Draw a resource map
otc_selection %>% resource_map()
# Create otc_returned
otc_returned <- otc %>% filter_activity_presence("Return Goods")

# Compute percentage of returned orders
n_cases(otc_returned)/n_cases(otc)
## [1] 0.2130923
# Trim cases and visualize
otc_returned %>% filter_trim(start_activities="Return Goods") %>% process_map()
# Time from order to delivery
# otc %>% filter_trim(start_activities="Receive Sales Order", end_activities="Order Delivered") %>% 
#     processing_time(units="days")


# Plot processing time by type
# otc %>%
#     group_by(type) %>%
#     throughput_time() %>%
#     plot()

Network Science in R - A Tidy Approach

Chapter 1 - Hubs of the Network

Network science - include social networks, neural networks, etc.:

  • Nodes and edges (connections between nodes, aka “ties”) make up a network
    • In a directed network, ties have a direction (for example, followers and follwing)
    • In an undirected network, ties do not have a direction (for example, mutual friendship)
    • In a weighted network, the ties have an associated weight (such as bandwidth, duration of friendship, etc.)
  • Chapter will focus on the terrorism network associated with the Madrid train bombing of 2004
    • Ties include friendhsip, training camps, previous attacks, and other terrorists
  • The network is reflected in tidy fashion, using one data frame for nodes and another for ties
    • g <- igraph::graph_from_data_frame(d = ties, directed = FALSE, vertices = nodes)
    • V(g); vcount(g)
    • E(g); ecount(g)
  • And, then working with attributes of the network
    • g\(name <- "Madrid network"; g\)name
    • V(g)$id <- 1:vcount(g)
    • E(g)$weight

Visualizing networks:

  • The ggraph package can help with visualizing networks
    • ggraph(g, layout = “with_kk”) + geom_edge_link(aes(alpha = weight)) + geom_node_point()
    • Much like the language of ggplot2

Centrality measures:

  • Objective is to find the most important nodes - connections among members of the networks
  • Network science is a spinoff of data science, with the goal of measuring networks
  • The agree of “degree” measures the number of ties (edges) that a node has
    • degree(g) # gives the number of edges per node
    • strength(g) # sumes the weights of the edges per node

Example code includes:

# read the nodes file into the variable nodes
nodes <- readr::read_csv("./RInputFiles/nodes.csv")
nodes

# read the ties file into the variable ties
ties <- readr::read_csv("./RInputFiles/ties.csv")
ties


library(igraph)
library(ggraph)


# make the network from the data frame ties and print it
g <- graph_from_data_frame(ties, directed = FALSE, vertices = nodes)
g

# explore the set of nodes
V(g)

# print the number of nodes
vcount(g)

# explore the set of ties
E(g)

# print the number of ties
ecount(g)


# give the name "Madrid network" to the network and print the network `name` attribute
g$name <- "Madrid network"
g$name

# add node attribute id and print the node `id` attribute
V(g)$id <- 1:vcount(g)
V(g)$id

# print the tie `weight` attribute
E(g)$weight

# print the network and spot the attributes
g


# visualize the network with layout Kamada-Kawai
ggraph(g, layout = "with_kk") + 
  geom_edge_link(aes(alpha = weight)) + 
  geom_node_point()


# add an id label to nodes
ggraph(g, layout = "with_kk") +
  geom_edge_link(aes(alpha = weight)) +
  geom_node_point()  + 
  geom_node_text(aes(label = id), repel=TRUE)


# visualize the network with circular layout. Set tie transparency proportional to its weight
ggraph(g, layout = "in_circle") + 
  geom_edge_link(aes(alpha = weight)) + 
  geom_node_point()


# visualize the network with grid layout. Set tie transparency proportional to its weight
ggraph(g, layout = "grid") + 
  geom_edge_link(aes(alpha = weight)) + 
  geom_node_point()


# compute the degrees of the nodes
dgr <- degree(g)

# add the degrees to the data frame object
nodes <- mutate(nodes, degree = dgr)

# add the degrees to the network object
V(g)$degree <- dgr

# arrange the terrorists in decreasing order of degree
arrange(nodes, -degree)


# compute node strengths
stg <- strength(g)

# add strength to the data frame object using mutate
nodes <- mutate(nodes, strength = stg)

# add the variable stg to the network object as strength
V(g)$strength <- stg

# arrange terrorists in decreasing order of strength and then in decreasing order of degree
arrange(nodes, -degree)
arrange(nodes, -strength)

Chapter 2 - Weakness and strength

Tie betweenness:

  • Betweeness is the number of shortest paths that go through a specific tie (edge) - these removals would be the most disruptive
  • In a weighted network, the shortest path is defined as the lowest sum of weights, rather than the fewest edges
    • Often need to inverse the weights prior to running, since a “high” weight usually means a close connection and thus an easy path
    • dist_weight = 1 / E(g)$weight
    • edge_betweenness(g, weights = dist_weight)

Visualizing centrality measures:

  • Visualizing betweenness can be done within the igraph package
    • ggraph(g, layout = “with_kk”) + geom_edge_link(aes(alpha = betweenness)) + geom_node_point()
    • ggraph(g, layout = “with_kk”) + geom_edge_link(aes(alpha = weight)) + geom_node_point(aes(size = degree))

The strength of weak ties:

  • “The strength of weak ties” is a research paper written about network strengths
    • Argument is that the “weak ties” in a network are often the most important - relationships between diverse communities, leading to diverse ideas
    • The “strong ties” are the relationships between people who are frequently together - can lead to group-think and stasis
    • Noted that the Madrid group (and similar) tended to be highly dispersed and thus having many weak ties
    • ties %>% group_by(weight) %>% summarise(n = n(), p = n / nrow(ties)) %>% arrange(-n)

Example code includes:

# save the inverse of tie weights as dist_weight
dist_weight <- 1 / E(g)$weight

# compute weighted tie betweenness
btw <- edge_betweenness(g, weights = dist_weight)

# mutate the data frame ties adding a variable betweenness using btw
ties <- mutate(ties, betweenness=btw)

# add the tie attribute betweenness to the network
E(g)$betweenness <- btw


# join ties with nodes
ties_joined <- ties %>% 
  left_join(nodes, c("from" = "id")) %>% 
  left_join(nodes, c("to" = "id")) 

# select only relevant variables and save to ties
ties_selected <- ties_joined %>% 
  select(from, to, name_from = name.x, name_to = name.y, betweenness)

# arrange named ties in decreasing order of betweenness
arrange(ties_selected, -betweenness)


# set (alpha) proportional to weight and node size proportional to degree
ggraph(g, layout = "with_kk") + 
  geom_edge_link(aes(alpha=weight)) + 
  geom_node_point(aes(size=degree))

# produce the same visualization but set node size proportional to strength
ggraph(g, layout = "with_kk") + 
  geom_edge_link(aes(alpha = weight)) + 
  geom_node_point(aes(size = strength))


# visualize the network with tie transparency proportional to betweenness
ggraph(g, layout = "with_kk") + 
  geom_edge_link(aes(alpha = betweenness)) + 
  geom_node_point()

# add node size proportional to degree
ggraph(g, layout = "with_kk") + 
  geom_edge_link(aes(alpha = betweenness)) + 
  geom_node_point(aes(size = degree))


# find median betweenness
q = median(E(g)$betweenness)

# filter ties with betweenness larger than the median
ggraph(g, layout = "with_kk") + 
  geom_edge_link(aes(alpha = betweenness, filter = (betweenness > q))) + 
  geom_node_point() + 
  theme(legend.position="none")


# find number and percentage of weak ties
ties %>%
  group_by(weight) %>%
  summarise(number = n(), percentage=n()/nrow(.)) %>%
  arrange(-number)


# build vector weakness containing TRUE for weak ties
weakness <- ifelse(ties$weight == 1, TRUE, FALSE)

# check that weakness contains the correct number of weak ties
sum(weakness)


# visualize the network by coloring the weak and strong ties
ggraph(g, layout = "with_kk") + 
  geom_edge_link(aes(color = weakness)) + 
  geom_node_point()


# visualize the network with only weak ties using the filter aesthetic
ggraph(g, layout = "with_kk") + 
  geom_edge_link(aes(filter=weakness), alpha = 0.5) + 
  geom_node_point()

Chapter 3 - Connection patterns

Connection patterns:

  • The adjacency matrix can be calculated using as_adjacency_matrix(g)
    • For each match of row/column, there will be a 1 for adjacency and a 0 for non-adjacency
    • Alternately, can have the weight of the tie as the entry for each row/column (with 0 as before meaning non-adjacency)
    • A = as_adjacency_matrix(g, attr = “weight”)
    • diag(A)
  • Can use the adjacency matrix to assess similarity of nodes in the matrix
    • The Pearson similarity measures the correlation between the columns in the matrix

Pearson correlation coefficient:

  • Can visualize the correlations using scatterplots
  • Can compute the correlations analytically as well
    • cor(nodes\(degree, nodes\)strength)

Most similar and most dissimilar terrorists:

  • Can use named graphs with weighted ties for a graphical representation of nodes and paths
  • Can use the adjacency matrix to reprsent the ties in a manner simplified for algebra
  • Can use the data frame format (one for nodes, and one for ties) for use with dplur and ggplot2
    • as_data_frame(g, what = “both”)
  • Can easily switch back and forth between the representations of the network
    • as_adjacency_matrix(g)
    • graph_from_adjacency_matrix(A)
    • as_data_frame(g, what = “both”)
    • graph_from_data_frame(df\(ties, vertices = df\)nodes)
    • as_data_frame(graph_from_adjacency_matrix(A), what = “both”)
    • as_adjacency_matrix(graph_from_data_frame(df\(ties, vertices = df\)nodes))

Example code includes:

# mutate ties data frame by swapping variables from and to 
ties_mutated <- mutate(ties, temp = to, to = from, from = temp) %>% select(-temp)

# append ties_mutated data frame to ties data frame
ties <- rbind(ties, ties_mutated)

# use a scatter plot to visualize node connection patterns in ties setting color aesthetic to weight
ggplot(ties, aes(x = from, y = to, color = factor(weight))) +
  geom_point() +
  labs(color = "weight")


# get the weighted adjacency matrix
A <- as_adjacency_matrix(g, attr = "weight", sparse = FALSE, names = FALSE)

# print the first row and first column of A
A[1, ]
A[, 1]

# print submatrix of the first 6 rows and columns
A[1:6, 1:6]


# obtain a vector of node strengths
rowSums(A)

# build a Boolean (0/1) matrix from the weighted matrix A
B <- ifelse(A > 0, 1, 0)

# obtain a vector of node degrees using the Boolean matrix
rowSums(B)


# compute the Pearson correlation on columns of A
S <- cor(A)

# set the diagonal of S to 0
diag(S) = 0

# print a summary of the similarities in matrix S
summary(c(S))

# plot a histogram of similarities in matrix S
hist(c(S), xlab = "Similarity", main = "Histogram of similarity")


# Scatter plot of degree and strength with regression line
ggplot(nodes, aes(x = degree, y = strength)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

# Pearson correlation coefficient 
cor(nodes$degree, nodes$strength)


# build weighted similarity network and save to h
h <- graph_from_adjacency_matrix(S, mode = "undirected", weighted = TRUE)

# convert the similarity network h into a similarity data frame sim_df
sim_df <- as_data_frame(h, what = "edges")

# map the similarity data frame to a tibble and save it as sim_tib
sim_tib <- as_tibble(sim_df)

# print sim_tib
sim_tib


# left join similarity and nodes data frames and then select and rename relevant variables
sim2 <- sim_tib %>% 
  left_join(nodes, c("from" = "id")) %>% 
  left_join(nodes, c("to" = "id")) %>%
  select(from, to, name_from = name.x, name_to = name.y, similarity = weight, 
         degree_from = degree.x, degree_to = degree.y, strength_from = strength.x, strength_to = strength.y)
  
# print sim2
sim2


# arrange sim2 in decreasing order of similarity. 
sim2 %>% arrange(-similarity)

# filter sim2, allowing only pairs with a degree of least 10, arrange the result in decreasing order of similarity
sim2 %>%
  filter(degree_from >= 10, degree_to >= 10) %>%
  arrange(-similarity)

# Repeat the previous steps, but in increasing order of similarity
sim2 %>%
  filter(degree_from >= 10, degree_to >= 10) %>%
  arrange(similarity)


# filter the similarity data frame to similarities larger than or equal to 0.60
sim3 <- filter(sim2, similarity >= 0.6)

# build a similarity network called h2 from the filtered similarity data frame
h2 <- graph_from_data_frame(sim3, directed = FALSE)

# visualize the similarity network h2
ggraph(h2, layout = "with_kk") + 
  geom_edge_link(aes(alpha = similarity)) + 
  geom_node_point()

Chapter 4 - Similarity Clusters

Hierarchical clustering - find clusters of similar people:

  • Basic idea is to define a measure of similarity, then match the most similar entities to groups, proceeding until there is a single cluster containing everyone
  • The dendrogram (tree diagram) is helpful for viewing this data
  • The similarity measure between individual nodes (person similarity) exists, and needs to be extended to groups
    • Single-linkage - similarity is the maximum of the similarities of anyone in the groups
    • Complete-linkage - similarity is the minimum of the similarities of anyone in the groups
    • Average-linkage - similarity is the average of the simlarities of everyone in the groups
  • The clustering algorithm works as follows
    • Evaluate simlarity for all node pairs
    • Assign each node to its own group
    • Find the pair of groups with the highest simlarity, and join them
    • Calculate simlarity of this newly formed group to all previously existing entities (groups or individuals)
    • Repeat until there is just a single cluster remaining
  • The R implementation is hclust()
    • D <- 1-S
    • d <- as.dist(D)
    • cc <- hclust(d, method = “average”)
    • cls <- cutree(cc, k = 4)

Interactive visualizations with visNetwork:

  • visNetwork is an interactive package for viewing networks
    • Many different layouts are available, and you can interact with the nodes and the ties
    • Can select nodes and see their neighborhoods (nodes within a certain distance)
    • Can select nodes by name
    • Can partition nodes in to groups and color, highlight, etc.

Wrap up:

  • Analysis of networks with measures of centrality and similarity
  • Visualization of networks, including interactivity

Example code includes:

# compute a distance matrix
D <- 1 - S

# obtain a distance object 
d <- as.dist(D)

# run average-linkage clustering method and plot the dendrogram 
cc <- hclust(d, method = "average")
plot(cc)

# find the similarity of the first pair of nodes that have been merged 
S[40, 45]


# cut the dendrogram at 4 clusters
cls <- cutree(cc, k = 4)

# add cluster information to the nodes data frame
nodes <- mutate(nodes, cluster = cls)

# print the nodes data frame
nodes


# output the names of terrorists in the first cluster
filter(nodes, cluster == 1) %>% 
    select(name)

# for each cluster select the size of the cluster, the average node degree, and the average node strength and sorts by cluster size
group_by(nodes, cluster) %>%
  summarise(size = n(), 
            avg_degree = mean(degree),
            avg_strength = mean(strength)
            ) %>%
  arrange(-size)


# add cluster information to the network 
V(g)$cluster <- nodes$cluster

# visualize the original network with colored clusters
ggraph(g, layout = "with_kk") + 
  geom_edge_link(aes(alpha = weight), show.legend=FALSE) + 
  geom_node_point(aes(color = factor(cluster))) +
  labs(color = "cluster")

# facet the network with respect to cluster attribute
ggraph(g, layout = "with_kk") + 
  geom_edge_link(aes(alpha = weight), show.legend=FALSE) + 
  geom_node_point(aes(color = factor(cluster))) +
  facet_nodes(~cluster, scales="free")  +
  labs(color = "cluster")


# convert igraph to visNetwork
data <- visNetwork::toVisNetworkData(g)

# print head of nodes and ties
head(data$nodes)
head(data$edges)

# visualize the network
visNetwork::visNetwork(nodes = data$nodes, edges = data$edges, width = 300, height = 300)


# use the circle layout
visNetwork::visNetwork(nodes = data$nodes, edges = data$edges, width = 300, height = 300) %>%
  visNetwork::visIgraphLayout(layout = "layout_with_kk")

# use the circle layout
visNetwork::visNetwork(nodes = data$nodes, edges = data$edges, width = 300, height = 300) %>%
  visNetwork::visIgraphLayout(layout = "layout_in_circle")

# use the grid layout
visNetwork::visNetwork(nodes = data$nodes, edges = data$edges, width = 300, height = 300) %>%
  visNetwork::visIgraphLayout(layout = "layout_on_grid")


# highlight nearest nodes and ties of the selected node
visNetwork::visNetwork(nodes = data$nodes, edges = data$edges, width = 300, height = 300) %>%
  visNetwork::visIgraphLayout(layout = "layout_with_kk") %>%
  visNetwork::visOptions(highlightNearest = TRUE) 


# select nodes by id 
visNetwork::visNetwork(nodes = data$nodes, edges = data$edges, width = 300, height = 300) %>%
  visNetwork::visIgraphLayout(layout = "layout_with_kk") %>%
  visNetwork::visOptions(nodesIdSelection = TRUE)

# set color to cluster and generate network data
V(g)$color = V(g)$cluster
data <- visNetwork::toVisNetworkData(g)

# select by group (cluster)
visNetwork::visNetwork(nodes = data$nodes, edges = data$edges, width = 300, height = 300) %>%
  visNetwork::visIgraphLayout(layout = "layout_with_kk") %>%
  visNetwork::visOptions(selectedBy = "group")

Data Privacy and Anaonymization in R

Chapter 1 - Introduction to Data Privacy

Intro to Anonymization - Part I:

  • Need to implement better data privacy techniques - e.g., census data, healthcare data, etc.
  • Need to have data such as individualized health, but not in a manner that identifies specific individuals
  • Topics covered in this course will include
    • Remove identifiers, synthesize data
    • Laplace mechnaism for removing names
    • Differential privacy and post-processing
    • Release of data using the above techniques
  • Data sets will include White House salaries and male infertility data
    • One basic technique is removing identifiers, such as replacing names with numbers
    • Another basic technique is to round continuous values (such as to the nearest 1000)

Intro to Anonymization - Part II:

  • Additional approaches include generalization and top/bottom coding
    • Generalization creates larger buckets of data
    • Top/bottom is about setting outliers back to a pre-defined top and bottom of the range
  • Additional dplyr functions of interest
    • count() is used to find the number of observations for each distinct group
    • whitehouse %>% count(Status)
    • whitehouse %>% count(Status, Title, sort = TRUE) # sort=TRUE sorts by descending n
    • summarize_at() lets you get summary statistics for a key variable
    • whitehouse %>% summarise_at(vars(Salary), sum) # vars() holds the bare variables, while sum is the requested function
    • whitehouse %>% summarise_at(vars(Salary), funs(mean, sd)) # funs() holds the list of functions that you want to apply

Data Synthesis:

  • Fake datasets created based on sampling from a probability distribution
  • Goal is a fake dataset (by definition anaonymized) that is statistically similar to the real dataset
    • For 1/0 data, sampling from the binomial distribution can work well
    • For bell-shaped data, the normal or log-normal can often work well (though there can be issues with bounding)
    • Hard-bounding is setting values to a proper max/min, while another approach is to discard the record and sample again

Example code includes:

load("./RInputFiles/dataPriv.RData")


# Preview data
whitehouse
## # A tibble: 469 x 5
##    Name                 Status   Salary Basis     Title                   
##    <chr>                <chr>     <dbl> <chr>     <chr>                   
##  1 Abrams, Adam W.      Employee  66300 Per Annum WESTERN REGIONAL COMMUN~
##  2 Adams, Ian H.        Employee  45000 Per Annum EXECUTIVE ASSISTANT TO ~
##  3 Agnew, David P.      Employee  93840 Per Annum DEPUTY DIRECTOR OF INTE~
##  4 Albino, James        Employee  91800 Per Annum SENIOR PROGRAM MANAGER  
##  5 Aldy, Jr., Joseph E. Employee 130500 Per Annum SPECIAL ASSISTANT TO TH~
##  6 Alley, Hilary J.     Employee  42000 Per Annum STAFF ASSISTANT         
##  7 Amorsingh, Lucius L. Employee  56092 Per Annum SPECIAL ASSISTANT       
##  8 Anderson, Amanda D.  Employee  60000 Per Annum SPECIAL ASSISTANT TO TH~
##  9 Anderson, Charles D. Employee  51000 Per Annum POLICY ASSISTANT        
## 10 Andrias, Kate E.     Employee 130500 Per Annum SPECIAL ASSISTANT TO TH~
## # ... with 459 more rows
# Set seed
set.seed(42)

# Replace names with random numbers from 1 to 1000
whitehouse_no_names <- whitehouse %>%
    mutate(Name = sample(1:1000, nrow(.), replace=FALSE))

whitehouse_no_names
## # A tibble: 469 x 5
##     Name Status   Salary Basis     Title                                  
##    <int> <chr>     <dbl> <chr>     <chr>                                  
##  1   915 Employee  66300 Per Annum WESTERN REGIONAL COMMUNICATIONS DIRECT~
##  2   937 Employee  45000 Per Annum EXECUTIVE ASSISTANT TO THE DIRECTOR OF~
##  3   286 Employee  93840 Per Annum DEPUTY DIRECTOR OF INTERGOVERNMENTAL A~
##  4   828 Employee  91800 Per Annum SENIOR PROGRAM MANAGER                 
##  5   640 Employee 130500 Per Annum SPECIAL ASSISTANT TO THE PRESIDENT FOR~
##  6   517 Employee  42000 Per Annum STAFF ASSISTANT                        
##  7   733 Employee  56092 Per Annum SPECIAL ASSISTANT                      
##  8   134 Employee  60000 Per Annum SPECIAL ASSISTANT TO THE CHIEF OF STAFF
##  9   652 Employee  51000 Per Annum POLICY ASSISTANT                       
## 10   699 Employee 130500 Per Annum SPECIAL ASSISTANT TO THE PRESIDENT AND~
## # ... with 459 more rows
# Rounding Salary to the nearest ten thousand
whitehouse_no_identifiers <- whitehouse_no_names %>%
    mutate(Salary = round(Salary, -4))

whitehouse_no_identifiers
## # A tibble: 469 x 5
##     Name Status   Salary Basis     Title                                  
##    <int> <chr>     <dbl> <chr>     <chr>                                  
##  1   915 Employee  70000 Per Annum WESTERN REGIONAL COMMUNICATIONS DIRECT~
##  2   937 Employee  40000 Per Annum EXECUTIVE ASSISTANT TO THE DIRECTOR OF~
##  3   286 Employee  90000 Per Annum DEPUTY DIRECTOR OF INTERGOVERNMENTAL A~
##  4   828 Employee  90000 Per Annum SENIOR PROGRAM MANAGER                 
##  5   640 Employee 130000 Per Annum SPECIAL ASSISTANT TO THE PRESIDENT FOR~
##  6   517 Employee  40000 Per Annum STAFF ASSISTANT                        
##  7   733 Employee  60000 Per Annum SPECIAL ASSISTANT                      
##  8   134 Employee  60000 Per Annum SPECIAL ASSISTANT TO THE CHIEF OF STAFF
##  9   652 Employee  50000 Per Annum POLICY ASSISTANT                       
## 10   699 Employee 130000 Per Annum SPECIAL ASSISTANT TO THE PRESIDENT AND~
## # ... with 459 more rows
# Convert the salaries into three categories
whitehouse.gen <- whitehouse %>%
    mutate(Salary = ifelse(Salary < 50000, 0, 
                           ifelse(Salary >= 50000 & Salary < 100000, 1, 2)))

whitehouse.gen
## # A tibble: 469 x 5
##    Name                 Status   Salary Basis     Title                   
##    <chr>                <chr>     <dbl> <chr>     <chr>                   
##  1 Abrams, Adam W.      Employee   1.00 Per Annum WESTERN REGIONAL COMMUN~
##  2 Adams, Ian H.        Employee   0    Per Annum EXECUTIVE ASSISTANT TO ~
##  3 Agnew, David P.      Employee   1.00 Per Annum DEPUTY DIRECTOR OF INTE~
##  4 Albino, James        Employee   1.00 Per Annum SENIOR PROGRAM MANAGER  
##  5 Aldy, Jr., Joseph E. Employee   2.00 Per Annum SPECIAL ASSISTANT TO TH~
##  6 Alley, Hilary J.     Employee   0    Per Annum STAFF ASSISTANT         
##  7 Amorsingh, Lucius L. Employee   1.00 Per Annum SPECIAL ASSISTANT       
##  8 Anderson, Amanda D.  Employee   1.00 Per Annum SPECIAL ASSISTANT TO TH~
##  9 Anderson, Charles D. Employee   1.00 Per Annum POLICY ASSISTANT        
## 10 Andrias, Kate E.     Employee   2.00 Per Annum SPECIAL ASSISTANT TO TH~
## # ... with 459 more rows
# Bottom Coding
whitehouse.bottom <- whitehouse %>%
    mutate(Salary = pmax(Salary, 45000))

# Filter Results
whitehouse.bottom %>%
    filter(Salary <= 45000)
## # A tibble: 109 x 5
##    Name                Status   Salary Basis     Title                    
##    <chr>               <chr>     <dbl> <chr>     <chr>                    
##  1 Adams, Ian H.       Employee  45000 Per Annum EXECUTIVE ASSISTANT TO T~
##  2 Alley, Hilary J.    Employee  45000 Per Annum STAFF ASSISTANT          
##  3 Asen, Jonathan D.   Employee  45000 Per Annum SENIOR ANALYST           
##  4 Ayling, Lindsay A.  Employee  45000 Per Annum ANALYST                  
##  5 Baggetto, Maude L.  Employee  45000 Per Annum STAFF ASSISTANT          
##  6 Bates, Andrew J.    Employee  45000 Per Annum MEDIA MONITOR            
##  7 Belive, Lauren E.   Employee  45000 Per Annum LEGISLATIVE ASSISTANT AN~
##  8 Bisi, Rachel I.     Employee  45000 Per Annum LEGISLATIVE ASSISTANT    
##  9 Block, Michael R.   Employee  45000 Per Annum STAFF ASSISTANT          
## 10 Blount, Patricia H. Employee  45000 Per Annum RECORDS MANAGEMENT ANALY~
## # ... with 99 more rows
# View fertility data
fertility
## # A tibble: 100 x 10
##    Season   Age Child_Disease Accident_Trauma Surgical_Interv~ High_Fevers
##     <dbl> <dbl>         <int>           <int>            <int>       <int>
##  1 -0.330 0.690             0               1                1           0
##  2 -0.330 0.940             1               0                1           0
##  3 -0.330 0.500             1               0                0           0
##  4 -0.330 0.750             0               1                1           0
##  5 -0.330 0.670             1               1                0           0
##  6 -0.330 0.670             1               0                1           0
##  7 -0.330 0.670             0               0                0          -1
##  8 -0.330 1.00              1               1                1           0
##  9  1.00  0.640             0               0                1           0
## 10  1.00  0.610             1               0                0           0
## # ... with 90 more rows, and 4 more variables: Alcohol_Freq <dbl>,
## #   Smoking <int>, Hours_Sitting <dbl>, Diagnosis <int>
# Number of participants with Surgical_Intervention and Diagnosis
fertility %>%
    summarise_at(vars(Surgical_Intervention, Diagnosis), sum)
## # A tibble: 1 x 2
##   Surgical_Intervention Diagnosis
##                   <int>     <int>
## 1                    51        12
# Mean and Standard Deviation of Age
fertility %>%
    summarise_at(vars(Age), funs(mean, sd))
## # A tibble: 1 x 2
##    mean    sd
##   <dbl> <dbl>
## 1 0.669 0.121
# Counts of the Groups in High_Fevers
fertility %>%
    count(High_Fevers)
## # A tibble: 3 x 2
##   High_Fevers     n
##         <int> <int>
## 1          -1     9
## 2           0    63
## 3           1    28
# Counts of the Groups in Child_Disease
fertility %>%
    count(Child_Disease, Accident_Trauma)
## # A tibble: 4 x 3
##   Child_Disease Accident_Trauma     n
##           <int>           <int> <int>
## 1             0               0    10
## 2             0               1     3
## 3             1               0    46
## 4             1               1    41
# Find proportions
fertility %>%
    summarise_at(vars(Accident_Trauma, Surgical_Intervention), mean)
## # A tibble: 1 x 2
##   Accident_Trauma Surgical_Intervention
##             <dbl>                 <dbl>
## 1           0.440                 0.510
# Set seed
set.seed(42)

# Generate Synthetic data
accident <- rbinom(100, 1, prob=0.440)
surgical <- rbinom(100, 1, prob=0.510)


# Square root Transformation of Salary
whitehouse.salary <- whitehouse %>%
    mutate(Salary = sqrt(Salary))

# Calculate the mean and standard deviation
stats <- whitehouse.salary %>%
    summarize(mean(Salary), sd(Salary))

stats
## # A tibble: 1 x 2
##   `mean(Salary)` `sd(Salary)`
##            <dbl>        <dbl>
## 1            279         71.8
# Generate Synthetic data
set.seed(42)
salary_transformed <- rnorm(nrow(whitehouse), mean=279, sd=71.8)

# Power transformation
salary_original <- salary_transformed ** 2

# Hard bound
salary <- ifelse(salary_original < 0, 0, salary_original)

Chapter 2 - Introduction to Differential Privacy

Differential Privacy - quantification of privacy loss via a privacy budget:

  • The worst-case scenario is that no assumptions are made about data intruders
    • If an individual is from a small group, their data may be 100% available by looking at statistics in aggregate and statistics for the group that excludes them (everyone else)
  • The privacy budget is defined using epsilon - smaller numbers mean that less information will be made available
  • The general concept is to look at a dataset that includes the segment the individual is in, and a dataset that includes all other segments
    • The answer sent back to the query will have noise added to it depending on the privacy budget
  • Basically, the differential privacy algorithm finds the most “unique” person in the dataset, and then decides how much noise to add based on how identifiable they are by attribute

Global Sensitivity - usual decision-making factor for differential privacy:

  • The global sensitivity of a query is the most a variable could change based on removing one individual
    • By definition, count queries always have a global sensitivity of 1 (exclude 1 individual)
    • Therefore, proportion queries always have a global sensitity of 1/n
    • Mean queries always have a global sensitivity of (max - min) / n
    • Variance queries always have a global sensitivity of (max - min)^2 / n
  • The global sensitivity and the epsilon work together to determine the amount of noise
    • Measures like median are not very sensitive to outliers, and thus very little noise needs to be added
    • Measures like maximum are very sensitive to outliers (e.g., Bill Gates income), and thus very little noise needs to be added

Laplace Mechanism - adds noise based on the Laplace distribution with mean 0 and parameters global sensitivity and privacy budget:

  • fertility %>% summarise_at(vars(Child_Disease), sum)
  • library(smoothmest) # has function rdoublex(draws, mean, shaping) - set draws=1, mean=true_mean, shaping=globalSensitivity / epsilon

Example code includes:

# Number of observations
n <- nrow(fertility)

# Global sensitivity of counts
gs.count <- 1

# Global sensitivity of proportions
gs.prop <- 1/n


# Lower bound of Hours_Sitting
a <- 0

# Upper bound of Hours_Sitting
b <- 1

# Global sensitivity of mean for Hours_Sitting
gs.mean <- (b - a) / n

# Global sensitivity of proportions Hours_Sitting
gs.var <- (b - a)**2 / n


# How many participants had a Surgical_Intervention?
fertility %>%
   summarise_at(vars(Surgical_Intervention), sum)
## # A tibble: 1 x 1
##   Surgical_Intervention
##                   <int>
## 1                    51
# Set the seed
set.seed(42)

# Apply the Laplace mechanism
eps <- 0.1
smoothmest::rdoublex(1, 51, 1/eps)
## [1] 52.98337
# Proportion of Accident_Trauma
stats <- fertility %>%
   summarise_at(vars(Accident_Trauma), mean)

stats
## # A tibble: 1 x 1
##   Accident_Trauma
##             <dbl>
## 1           0.440
# Set the seed
set.seed(42)

# Apply the Laplace mechanism
eps <- 0.1
smoothmest::rdoublex(1, 0.440, (1/n)/eps)
## [1] 0.4598337
# Mean and Variance of Hours Sitting
fertility %>%
    summarise_at(vars(Hours_Sitting), funs(mean, var))
## # A tibble: 1 x 2
##    mean    var
##   <dbl>  <dbl>
## 1 0.407 0.0347
# Setup
set.seed(42)
eps <- 0.1

# Laplace mechanism to mean
smoothmest::rdoublex(1, 0.41, gs.mean/eps)
## [1] 0.4298337
# Laplace mechanism to variance
smoothmest::rdoublex(1, 0.03, gs.var/eps)
## [1] 0.0583491

Chapter 3 - Differentially Private Properties

Sequential Composition - method to require that someone cannot find the real answer by just sending multiple queries:

  • Idea is that the privacy budget is divided by the number of queries you plan to send
  • For example, if a query will be made for mean and another query will be made for maximum, then epsilon needs to be divided by two

Parallel Composition - method to account for queries to different parts of the database (no adjustment to epsilon needed):

  • Deciding between sequential and parallel is whether queries could be answered using completely different (MECE) splits of the dataset

Post-processing:

  • When new queries can be answered using data that has already been privatized, it can be synthesized to a noisy answer to this new query
    • The privacy budget need not be adjusted in this case
    • For example, if there are three groups, can just add noise to two of the groups and let the third group be total minus these two groups

Impossible and inconsistent answers:

  • Bounding can be introduced, such as making all negative numbers zero or anything greater than the total to the total
    • rdoublex(1, 12, gs.count / eps) %>% round() %>% max(0) # lower bound is zero
    • normalized <- (smoking/sum(smoking)) * (nrow(fertility)) # upper bound is the size of the dataset

Example code includes:

# Set Value of Epsilon
eps <- 0.1 / 2

# Number of observations
n <- nrow(fertility)

# Lower bound of Age
a <- 0

# Upper bound of Age
b <- 1

# GS of counts for Diagnosis
gs.count <- 1

# GS of mean for Age
gs.mean <- (b-a)/n


# Number of Participants with abnormal diagnosis
stats1 <- fertility %>% 
    summarize_at(vars(Diagnosis), sum)

stats1
## # A tibble: 1 x 1
##   Diagnosis
##       <int>
## 1        12
# Mean of age
stats2 <- fertility %>%
    summarize_at(vars(Age), mean)

stats2
## # A tibble: 1 x 1
##     Age
##   <dbl>
## 1 0.669
# Set seed
set.seed(42)

# Laplace mechanism to the count of abnormal diagnosis
smoothmest::rdoublex(1, 12, gs.count/eps)
## [1] 15.96674
# Laplace mechanism to the mean of age
smoothmest::rdoublex(1, 0.67, gs.mean/eps)
## [1] 0.7266982
# Set Value of Epsilon
eps <- 0.1

# Mean of Age per diagnosis level 
fertility %>%
  group_by(Diagnosis) %>%
  summarise_at(vars(Age), mean)
## # A tibble: 2 x 2
##   Diagnosis   Age
##       <int> <dbl>
## 1         0 0.664
## 2         1 0.707
# Set the seed
set.seed(42)

# Laplace mechanism to the mean age of participants with an abnormal diagnoisis
smoothmest::rdoublex(1, 0.71, gs.mean/eps)
## [1] 0.7298337
# Laplace mechanism to the mean age of participants with a normal diagnoisis
smoothmest::rdoublex(1, 0.66, gs.mean/eps)
## [1] 0.6883491
# Set Value of Epsilon
eps <- 0.5/3

# GS of Counts
gs.count <- 1

# Number of participants in each of the four seasons
fertility %>%
    group_by(Diagnosis) %>%
    summarise_at(vars(Age), mean)
## # A tibble: 2 x 2
##   Diagnosis   Age
##       <int> <dbl>
## 1         0 0.664
## 2         1 0.707
# Set the seed
set.seed(42)

# Laplace mechanism to the number of participants who were evaluated in the winter, spring, and summer
winter <- smoothmest::rdoublex(1, 28, gs.count / eps) %>%
    round()

spring <- smoothmest::rdoublex(1, 37, gs.count / eps) %>%
    round()

summer <- smoothmest::rdoublex(1, 4, gs.count / eps) %>%
    round()

# Post-process based on previous queries
fall <- nrow(fertility) - winter - spring - summer


# Set Value of Epsilon
eps <- 0.01

# GS of counts
gs.count <- 1

# Number of Participants with Child_Disease
fertility %>%
    summarise_at(vars(Child_Disease), sum)
## # A tibble: 1 x 1
##   Child_Disease
##           <int>
## 1            87
# Apply the Laplace mechanism
set.seed(42)
lap_childhood <- smoothmest::rdoublex(1, 87, gs.count / eps) %>%
    round()

# Total number of observations in fertility
max_value <- nrow(fertility)

# Bound the value such that the noisy answer does not exceed the total number of observations
ifelse(lap_childhood > max_value, max_value, lap_childhood)
## [1] 100
# Set the seed
set.seed(42)

# Apply the Laplace mechanism
fever1 <- smoothmest::rdoublex(1, 9, gs.count/eps) %>%
    max(0)
fever2 <- smoothmest::rdoublex(1, 63, gs.count/eps) %>%
    max(0)
fever3 <- smoothmest::rdoublex(1, 28, gs.count/eps) %>%
    max(0)

fever <- c(fever1, fever2, fever3)

# Normalize noise 
fever_normalized <- (fever/sum(fever)) * (nrow(fertility))

# Round the values
round(fever_normalized)
## [1] 24 76  0

Chapter 4 - Differentially Private Data Synthesis

Laplace Sanitizer - basic way to generate “noisy” categorical data:

  • Takes advantage of parallel - if the data can be binned or placed in a contingency table, assumes no more need to divide the privacy budget
    • Since the data is queries as a histogram, it can be considered disjoint (non-overlapping) and thus parallel composition
  • Can generate data using rep() for a single vector

Parametric Approaches:

  • Sampling from a binomial distribution (where appropriate), with a known proportion that has been modified by Laplace differential privacy guarantee
  • Sampling from a normal or log-normal distribution (where appropriate), with a known mean and variance that has been modified by Laplace differential privacy guarantee

Wrap up:

  • Basics of anonymyzing data, such as removing names
  • Basics of modifying data such as generalizing to categorical data
  • Basics of generating synthetic data using rbinom() and rnorm()
  • Basics of privacy budgets, global sensitivities, and the Laplace mechanism
  • Basics of differential privacy, such as sequential (split epsilon) or parallel (including through binning or continegnecy tables)
  • Basics of the Laplace sanitizer for both categorical data (rbinom) and continuous data (rnorm)
  • Next steps include managing data gaps, incorrect statistics distributions with hard bounding, etc.
    • Local differential privacy (Apple) and probabilistic differential privacy (US census)
    • Techniques specific to GPS data or PCA

Example code includes:

# Set Value of Epsilon
eps <- 0.1

# GS of Counts
gs.count <- 1

# Number of participants in each season
fertility %>%
    count(Season)
## # A tibble: 4 x 2
##   Season     n
##    <dbl> <int>
## 1 -1.00     28
## 2 -0.330    37
## 3  0.330     4
## 4  1.00     31
# Set the seed
set.seed(42)

# Apply the Laplace mechanism 
winter <- smoothmest::rdoublex(1, 28, gs.count/eps) %>% max(0)
spring <- smoothmest::rdoublex(1, 37, gs.count/eps) %>% max(0)
summer <- smoothmest::rdoublex(1, 4, gs.count/eps) %>% max(0)
fall <- smoothmest::rdoublex(1, 31, gs.count/eps) %>% max(0)


# Store noisy results
seasons <- c(winter = winter, spring = spring, summer = summer, fall = fall)

# Normalizing seasons
seasons_normalized <- (seasons/sum(seasons)) * nrow(fertility)

# Round the values
round(seasons_normalized)
## winter spring summer   fall 
##     29     38      0     33
# Generate synthetic data for winter
rep(-1, 29)
##  [1] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
## [24] -1 -1 -1 -1 -1 -1
# Generate synthetic data for spring
rep(-0.33, 38)
##  [1] -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33
## [12] -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33
## [23] -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33 -0.33
## [34] -0.33 -0.33 -0.33 -0.33 -0.33
# Generate synthetic data for summer
rep(0.33, 0)
## numeric(0)
# Generate synthetic data for fall
rep(1, 33)
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# Calculate proportions
fertility %>%
    summarise_at(vars(Accident_Trauma, Surgical_Intervention), mean)
## # A tibble: 1 x 2
##   Accident_Trauma Surgical_Intervention
##             <dbl>                 <dbl>
## 1           0.440                 0.510
# Number of Observations
n <- nrow(fertility)

# Set Value of Epsilon
eps <- 0.1

# GS of Proportion
gs.prop <- (1/n)


# Apply the Laplace mechanism
set.seed(42)
smoothmest::rdoublex(1, 0.44, gs.prop/eps)
## [1] 0.4598337
smoothmest::rdoublex(1, 0.51, gs.prop/eps)
## [1] 0.5383491
# Generate Synthetic data
set.seed(42)
accident <- rbinom(n, 1, 0.46)
surgical <- rbinom(n, 1, 0.54)


# Set Value of Epsilon
eps <- 0.1 / 2

# Number of observations
n <- nrow(fertility)

# Upper and lower bounds of age
a <- 0
b <- 1

# GS of mean and variance for age
gs.mean <- (b-a) / n
gs.var <- (b-a)**2 / n


# Mean and Variance of Age
fertility %>%
    summarise_at(vars(Age), funs(mean, var))
## # A tibble: 1 x 2
##    mean    var
##   <dbl>  <dbl>
## 1 0.669 0.0147
# Apply the Laplace mechanism
set.seed(42)
smoothmest::rdoublex(1, 0.67, gs.mean/eps)
## [1] 0.7096674
smoothmest::rdoublex(1, 0.01, gs.var/eps)
## [1] 0.06669821
# Generate Synthetic data
set.seed(42)
age <- rnorm(n, mean=0.71, sd=sqrt(0.07))

# Hard Bounding the data
age[age < 0] <- 0
age[age > 1] <- 1

Marketing Analytics in R: Statistical Modeling

Chapter 1 - Modeling Customer Lifetime Value with Linear Regression

Introduction - Verena from INWT Statistics (consultancy in marketing analytics):

  • Customer Lifetime Value (CLV) is the expected value of forecasted customer value to the company
    • CLV is based on margin, and needs to use current information to predict future margins
    • Customers predicted to have higher CLV can then be targeted
  • Can inspect the data without seeing attributes using str(clvData1, give.attr = FALSE)
  • Can derive correlations using corrplot
    • library(corrplot)
    • clvData1 %>% select(nOrders, nItems, … ,margin, futureMargin) %>% cor() %>% corrplot()

Simple linear regression - one predictor variable to predict one response variable:

  • Can run linear regressions using basic stats modules
    • simpleLM <- lm(futureMargin ~ margin, data = clvData1)
    • summary(simpleLM)
  • Can plot previous margin vs. current margin, including a linear regression (smooth)
    • ggplot(clvData1, aes(margin, futureMargin)) + geom_point() + geom_smooth(method = lm, se = FALSE) + xlab(“Margin year 1”) + ylab(“Margin year 2”)
  • Several conditions must apply for linear regression to be the best method
    • Linear relationship between x and y
    • No measurement error in x (weak exogeneity)
    • Independence of errors
    • Expectation of errors is 0
    • Constant variance of prediction errors (homoscedasticity)
    • Normality of errors

Multiple linear regression:

  • Omitted variable bias is when a variable not in the regression is correlated with both the predictor and the response variables
    • Simpson’s Paradox is an example - upward sloping becomes downward sloping after properly splitting on the extra variable
  • Multicollinearity is a threat to a linear regression - leads to unstable regression coefficients, with associated under-reporting of standard errors
    • rms::vif(myLMModel) # above 5 is concerning, above 10 almost always needs to be addressed

Model validation, fit, and prediction:

  • The R-squared is the proportion of variance in the depedent variable that is explained by the regression
  • Can look at the p-value of the F-test to assess the overall statistical significance of the model
  • There is a risk of over-fitting, when the model is overly complex and learns artifacts of the training data rather than genuine patterns
    • Can use stats::AIC() or MASS::stepAIC(), with the goal being to minimize AIC (needs to be models of the same data)
    • AIC(multipleLM2)
  • Can predict outputs automatically, such as with
    • predMargin <- predict(multipleLM2, newdata = clvData2)

Example code includes:

salesData <- readr::read_csv("./RInputFiles/salesData.csv")
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   nItems = col_integer(),
##   mostFreqStore = col_character(),
##   mostFreqCat = col_character(),
##   nCats = col_integer(),
##   preferredBrand = col_character(),
##   nBrands = col_integer(),
##   nPurch = col_integer(),
##   salesLast3Mon = col_double(),
##   salesThisMon = col_double(),
##   daysSinceLastPurch = col_integer(),
##   meanItemPrice = col_double(),
##   meanShoppingCartValue = col_double(),
##   customerDuration = col_integer()
## )
# Structure of dataset
str(salesData, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    5122 obs. of  14 variables:
##  $ id                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ nItems               : int  1469 1463 262 293 108 216 174 122 204 308 ...
##  $ mostFreqStore        : chr  "Stockton" "Stockton" "Colorado Springs" "Colorado Springs" ...
##  $ mostFreqCat          : chr  "Alcohol" "Alcohol" "Shoes" "Bakery" ...
##  $ nCats                : int  72 73 55 50 32 41 36 31 41 52 ...
##  $ preferredBrand       : chr  "Veina" "Veina" "Bo" "Veina" ...
##  $ nBrands              : int  517 482 126 108 79 98 78 62 99 103 ...
##  $ nPurch               : int  82 88 56 43 18 35 34 12 26 33 ...
##  $ salesLast3Mon        : num  2742 2791 1530 1766 1180 ...
##  $ salesThisMon         : num  1284 1243 683 730 553 ...
##  $ daysSinceLastPurch   : int  1 1 1 1 12 2 2 4 14 1 ...
##  $ meanItemPrice        : num  1.87 1.91 5.84 6.03 10.93 ...
##  $ meanShoppingCartValue: num  33.4 31.7 27.3 41.1 65.6 ...
##  $ customerDuration     : int  821 657 548 596 603 673 612 517 709 480 ...
# Visualization of correlations
salesData %>% select_if(is.numeric) %>%
  select(-id) %>%
  cor() %>%
  corrplot::corrplot()

# Frequent stores
ggplot(salesData) +
    geom_boxplot(aes(x = mostFreqStore, y = salesThisMon))

# Preferred brand
ggplot(salesData) +
    geom_boxplot(aes(x = preferredBrand, y = salesThisMon))

# Model specification using lm
salesSimpleModel <- lm(salesThisMon ~ salesLast3Mon, data = salesData)

# Looking at model summary
summary(salesSimpleModel)
## 
## Call:
## lm(formula = salesThisMon ~ salesLast3Mon, data = salesData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -570.18  -68.26    3.21   72.98  605.58 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   99.690501   6.083886   16.39   <2e-16 ***
## salesLast3Mon  0.382696   0.004429   86.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 117.5 on 5120 degrees of freedom
## Multiple R-squared:  0.5932, Adjusted R-squared:  0.5931 
## F-statistic:  7465 on 1 and 5120 DF,  p-value: < 2.2e-16
# Estimating the full model
salesModel1 <- lm(salesThisMon ~ . -id, data = salesData)

# Checking variance inflation factors
car::vif(salesModel1)
##                            GVIF Df GVIF^(1/(2*Df))
## nItems                11.772600  1        3.431122
## mostFreqStore          1.260469  9        1.012943
## mostFreqCat            1.527348  9        1.023809
## nCats                  8.402073  1        2.898633
## preferredBrand         1.682184  9        1.029316
## nBrands               14.150868  1        3.761764
## nPurch                 3.083952  1        1.756119
## salesLast3Mon          8.697663  1        2.949180
## daysSinceLastPurch     1.585057  1        1.258991
## meanItemPrice          1.987665  1        1.409846
## meanShoppingCartValue  2.247579  1        1.499193
## customerDuration       1.004664  1        1.002329
# Estimating new model by removing information on brand
salesModel2 <- lm(salesThisMon ~ . -id -preferredBrand -nBrands, data = salesData)

# Checking variance inflation factors
car::vif(salesModel2)
##                           GVIF Df GVIF^(1/(2*Df))
## nItems                6.987456  1        2.643380
## mostFreqStore         1.178251  9        1.009154
## mostFreqCat           1.269636  9        1.013351
## nCats                 5.813494  1        2.411119
## nPurch                3.069046  1        1.751869
## salesLast3Mon         8.412520  1        2.900435
## daysSinceLastPurch    1.579426  1        1.256752
## meanItemPrice         1.925494  1        1.387622
## meanShoppingCartValue 2.238410  1        1.496132
## customerDuration      1.002981  1        1.001489
salesData2_4 <- readr::read_csv("./RInputFiles/salesDataMon2To4.csv")
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   nItems = col_integer(),
##   mostFreqStore = col_character(),
##   mostFreqCat = col_character(),
##   nCats = col_integer(),
##   preferredBrand = col_character(),
##   nBrands = col_integer(),
##   nPurch = col_integer(),
##   salesLast3Mon = col_double(),
##   daysSinceLastPurch = col_integer(),
##   meanItemPrice = col_double(),
##   meanShoppingCartValue = col_double(),
##   customerDuration = col_integer()
## )
# getting an overview of new data
summary(salesData2_4)
##        id           nItems       mostFreqStore      mostFreqCat       
##  Min.   :   1   Min.   :   1.0   Length:5173        Length:5173       
##  1st Qu.:1372   1st Qu.:  84.0   Class :character   Class :character  
##  Median :2733   Median : 155.0   Mode  :character   Mode  :character  
##  Mean   :2729   Mean   : 185.9                                        
##  3rd Qu.:4085   3rd Qu.: 257.0                                        
##  Max.   :5455   Max.   :1461.0                                        
##      nCats       preferredBrand        nBrands           nPurch     
##  Min.   : 1.00   Length:5173        Min.   :  1.00   Min.   : 1.00  
##  1st Qu.:27.00   Class :character   1st Qu.: 45.00   1st Qu.:11.00  
##  Median :37.00   Mode  :character   Median : 75.00   Median :17.00  
##  Mean   :36.23                      Mean   : 81.66   Mean   :20.02  
##  3rd Qu.:46.00                      3rd Qu.:110.00   3rd Qu.:27.00  
##  Max.   :74.00                      Max.   :484.00   Max.   :86.00  
##  salesLast3Mon  daysSinceLastPurch meanItemPrice     meanShoppingCartValue
##  Min.   : 189   Min.   : 1.000     Min.   :  1.879   Min.   :  17.58      
##  1st Qu.:1068   1st Qu.: 2.000     1st Qu.:  6.049   1st Qu.:  53.88      
##  Median :1331   Median : 4.000     Median :  8.556   Median :  75.77      
##  Mean   :1324   Mean   : 6.589     Mean   : 12.116   Mean   :  91.88      
##  3rd Qu.:1570   3rd Qu.: 7.000     3rd Qu.: 12.969   3rd Qu.: 109.74      
##  Max.   :2745   Max.   :87.000     Max.   :313.050   Max.   :1147.66      
##  customerDuration
##  Min.   :  31.0  
##  1st Qu.: 580.0  
##  Median : 682.0  
##  Mean   : 676.8  
##  3rd Qu.: 777.0  
##  Max.   :1386.0
# predicting sales
predSales5 <- predict(salesModel2, newdata = salesData2_4)

# calculating mean of future sales
mean(predSales5)
## [1] 625.1438

Chapter 2 - Logistic Regression for Churn Prevention

Churn prevention in online marketing:

  • Objective is to predict the likelihood of a customer repeating their business, assessed using logistic regression
    • Model the log-odds (defined as log (P(Y=1) / P(Y=0))) as a linear function of the inputs
    • Convert the log-odds to odds (defined as P(Y=1) / P(Y=0)) by exponentiation
    • Convert the odds to a probability of churning, using odds / (1 + odds)
  • Can begin with basic data exploration
    • ggplot(churnData, aes(x = returnCustomer)) + geom_histogram(stat = “count”)

Modeling and model selection:

  • The logit model can be run using the GLM provided in R
    • logitModelFull <- glm(returnCustomer ~ title + newsletter + websiteDesign + …, family = binomial, churnData)
  • Interpreting the coefficients is not easy - they are related to the log-odds
    • Can exponentiate the coefficients to get their impact on the odds
    • Can then interpret that greater than 1 means “more likely, all else equal”
  • Can use MASS::stepAIC() to help refine the modeling
    • library(MASS)
    • logitModelNew <- stepAIC(logitModelFull, trace = 0)
    • summary(logitModelNew)
    • Produces a model with fewer variables and a lower AIC

In-sample model fit and thresholding:

  • There are three types of pseudo-R-squared statistics available for the results of logistical regression
    • McFadden: R-squared = 1 - L(null) / L(full)
    • Cox-Snell: R-squared = 1 - (L(null) / L(full)) ** (2/n)
    • Nagelkerke: R-squared = [1 - (L(null) / L(full)) ** (2/n)] / [1 - L(null) ** (2/n)]
    • Generally, anything above 0.2 is reasonably good
    • descr::LogRegR2(logitModelNew)
    • library(SDMTools)
    • churnData$predNew <- predict(logitModelNew, type = “response”, na.action = na.exclude) # get the prediction probabilities
    • data %>% select(returnCustomer, predNew) %>% tail()
    • confMatrixNew <- confusion.matrix(churnData\(returnCustomer, churnData\)predNew, threshold = 0.5) # this is the version from SDMTools
  • Can give different weights to the different errors (false negatives, false positives, etc.)
    • Can instead look at a payoff, defined based on scalars for the various quadrants

Out-of-sample validation and cross validation:

  • Begin by randomly splitting data in to training (roughly two-thirds) and holding back the remainder for validation (roughly one-third)
    • set.seed(534381)
    • churnData$isTrain <- rbinom(nrow(churnData), 1, 0.66)
    • train <- subset(churnData, churnData$isTrain == 1)
    • test <- subset(churnData, churnData$isTrain == 0)
    • test$predNew <- predict(logitTrainNew, type = “response”, newdata = test) # make predictions only on the test dataset
  • Cross-validation is an even more powerful tool for assessing out-of-sample error
    • Split the data in to k subsets, and run the model k times with k-1 training data and the last subset used as the validation data
    • Acc03 <- function(r, pi = 0) {
    • cm <- confusion.matrix(r, pi, threshold = 0.3)
    • acc <- sum(diag(cm)) / sum(cm) return(acc)
    • }
    • set.seed(534381)
    • boot::cv.glm(churnData, logitModelNew, cost = Acc03, K = 6)$delta
  • Can continually tweak the model to see if transforms, variable additions, etc., might tend to improve the out-of-sample error rate

Example code includes:

defaultData <- readr::read_delim("./RInputFiles/defaultData.csv", delim=";")
## Parsed with column specification:
## cols(
##   .default = col_integer()
## )
## See spec(...) for full column specifications.
# Summary of data
summary(defaultData)
##        ID           limitBal            sex          education    
##  Min.   :    1   Min.   :  10000   Min.   :1.000   Min.   :0.000  
##  1st Qu.: 4501   1st Qu.:  50000   1st Qu.:1.000   1st Qu.:1.000  
##  Median : 9000   Median : 130000   Median :2.000   Median :2.000  
##  Mean   : 9000   Mean   : 162902   Mean   :1.588   Mean   :1.835  
##  3rd Qu.:13500   3rd Qu.: 230000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :18000   Max.   :1000000   Max.   :2.000   Max.   :6.000  
##     marriage         age             pay1               pay2        
##  Min.   :0.00   Min.   :21.00   Min.   :-2.00000   Min.   :-2.0000  
##  1st Qu.:1.00   1st Qu.:28.00   1st Qu.:-1.00000   1st Qu.:-1.0000  
##  Median :2.00   Median :34.00   Median : 0.00000   Median : 0.0000  
##  Mean   :1.56   Mean   :35.48   Mean   : 0.02783   Mean   :-0.1017  
##  3rd Qu.:2.00   3rd Qu.:41.00   3rd Qu.: 0.00000   3rd Qu.: 0.0000  
##  Max.   :3.00   Max.   :75.00   Max.   : 8.00000   Max.   : 8.0000  
##       pay3              pay4              pay5             pay6        
##  Min.   :-2.0000   Min.   :-2.0000   Min.   :-2.000   Min.   :-2.0000  
##  1st Qu.:-1.0000   1st Qu.:-1.0000   1st Qu.:-1.000   1st Qu.:-1.0000  
##  Median : 0.0000   Median : 0.0000   Median : 0.000   Median : 0.0000  
##  Mean   :-0.1294   Mean   :-0.1974   Mean   :-0.228   Mean   :-0.2567  
##  3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.000   3rd Qu.: 0.0000  
##  Max.   : 8.0000   Max.   : 8.0000   Max.   : 8.000   Max.   : 8.0000  
##     billAmt1          billAmt2         billAmt3          billAmt4      
##  Min.   :-165580   Min.   :-33350   Min.   : -34041   Min.   :-170000  
##  1st Qu.:   3675   1st Qu.:  3149   1st Qu.:   2655   1st Qu.:   2245  
##  Median :  22450   Median : 21425   Median :  20035   Median :  18703  
##  Mean   :  50030   Mean   : 48131   Mean   :  45607   Mean   :  41074  
##  3rd Qu.:  65001   3rd Qu.: 62157   3rd Qu.:  58457   3rd Qu.:  50540  
##  Max.   : 964511   Max.   :983931   Max.   :1664089   Max.   : 891586  
##     billAmt5         billAmt6          payAmt1          payAmt2       
##  Min.   :-37594   Min.   :-339603   Min.   :     0   Min.   :      0  
##  1st Qu.:  1684   1st Qu.:   1150   1st Qu.:   949   1st Qu.:    696  
##  Median : 18046   Median :  16780   Median :  2087   Median :   2000  
##  Mean   : 39398   Mean   :  38009   Mean   :  5532   Mean   :   5731  
##  3rd Qu.: 49355   3rd Qu.:  48442   3rd Qu.:  5000   3rd Qu.:   5000  
##  Max.   :927171   Max.   : 961664   Max.   :505000   Max.   :1684259  
##     payAmt3          payAmt4          payAmt5            payAmt6      
##  Min.   :     0   Min.   :     0   Min.   :     0.0   Min.   :     0  
##  1st Qu.:   307   1st Qu.:   228   1st Qu.:   209.8   1st Qu.:     2  
##  Median :  1500   Median :  1486   Median :  1500.0   Median :  1400  
##  Mean   :  4629   Mean   :  4757   Mean   :  4763.7   Mean   :  5135  
##  3rd Qu.:  4000   3rd Qu.:  4000   3rd Qu.:  4000.0   3rd Qu.:  4000  
##  Max.   :896040   Max.   :497000   Max.   :417990.0   Max.   :528666  
##  PaymentDefault  
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2306  
##  3rd Qu.:0.0000  
##  Max.   :1.0000
# Look at data structure
str(defaultData, give.attr=FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    18000 obs. of  25 variables:
##  $ ID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ limitBal      : int  20000 120000 90000 50000 50000 50000 500000 100000 140000 20000 ...
##  $ sex           : int  2 2 2 2 1 1 1 2 2 1 ...
##  $ education     : int  2 2 2 2 2 1 1 2 3 3 ...
##  $ marriage      : int  1 2 2 1 1 2 2 2 1 2 ...
##  $ age           : int  24 26 34 37 57 37 29 23 28 35 ...
##  $ pay1          : int  2 -1 0 0 -1 0 0 0 0 -2 ...
##  $ pay2          : int  2 2 0 0 0 0 0 -1 0 -2 ...
##  $ pay3          : int  -1 0 0 0 -1 0 0 -1 2 -2 ...
##  $ pay4          : int  -1 0 0 0 0 0 0 0 0 -2 ...
##  $ pay5          : int  -2 0 0 0 0 0 0 0 0 -1 ...
##  $ pay6          : int  -2 2 0 0 0 0 0 -1 0 -1 ...
##  $ billAmt1      : int  3913 2682 29239 46990 8617 64400 367965 11876 11285 0 ...
##  $ billAmt2      : int  3102 1725 14027 48233 5670 57069 412023 380 14096 0 ...
##  $ billAmt3      : int  689 2682 13559 49291 35835 57608 445007 601 12108 0 ...
##  $ billAmt4      : int  0 3272 14331 28314 20940 19394 542653 221 12211 0 ...
##  $ billAmt5      : int  0 3455 14948 28959 19146 19619 483003 -159 11793 13007 ...
##  $ billAmt6      : int  0 3261 15549 29547 19131 20024 473944 567 3719 13912 ...
##  $ payAmt1       : int  0 0 1518 2000 2000 2500 55000 380 3329 0 ...
##  $ payAmt2       : int  689 1000 1500 2019 36681 1815 40000 601 0 0 ...
##  $ payAmt3       : int  0 1000 1000 1200 10000 657 38000 0 432 0 ...
##  $ payAmt4       : int  0 1000 1000 1100 9000 1000 20239 581 1000 13007 ...
##  $ payAmt5       : int  0 0 1000 1069 689 1000 13750 1687 1000 1122 ...
##  $ payAmt6       : int  0 2000 5000 1000 679 800 13770 1542 1000 0 ...
##  $ PaymentDefault: int  1 1 0 0 0 0 0 0 0 0 ...
# Analyze the balancedness of dependent variable
ggplot(defaultData, aes(x = PaymentDefault)) +
  geom_histogram(stat = "count") 
## Warning: Ignoring unknown parameters: binwidth, bins, pad

# Build logistic regression model
logitModelFull <- glm(PaymentDefault ~ limitBal + sex + education + marriage +
                   age + pay1 + pay2 + pay3 + pay4 + pay5 + pay6 + billAmt1 + 
                   billAmt2 + billAmt3 + billAmt4 + billAmt5 + billAmt6 + payAmt1 + 
                   payAmt2 + payAmt3 + payAmt4 + payAmt5 + payAmt6, 
                family = "binomial", data = defaultData)

# Take a look at the model
summary(logitModelFull)
## 
## Call:
## glm(formula = PaymentDefault ~ limitBal + sex + education + marriage + 
##     age + pay1 + pay2 + pay3 + pay4 + pay5 + pay6 + billAmt1 + 
##     billAmt2 + billAmt3 + billAmt4 + billAmt5 + billAmt6 + payAmt1 + 
##     payAmt2 + payAmt3 + payAmt4 + payAmt5 + payAmt6, family = "binomial", 
##     data = defaultData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0893  -0.7116  -0.5615  -0.2794   4.2501  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.711e-01  1.505e-01  -3.795 0.000148 ***
## limitBal    -4.825e-07  1.985e-07  -2.431 0.015052 *  
## sex         -8.251e-02  3.880e-02  -2.127 0.033457 *  
## education   -1.217e-01  2.745e-02  -4.434 9.23e-06 ***
## marriage    -1.711e-01  4.016e-02  -4.259 2.05e-05 ***
## age          4.824e-03  2.257e-03   2.137 0.032570 *  
## pay1         5.743e-01  2.221e-02  25.864  < 2e-16 ***
## pay2         5.156e-02  2.552e-02   2.020 0.043336 *  
## pay3         7.811e-02  2.863e-02   2.728 0.006375 ** 
## pay4        -1.191e-02  3.285e-02  -0.363 0.716838    
## pay5         1.080e-01  3.381e-02   3.193 0.001406 ** 
## pay6        -1.956e-02  2.750e-02  -0.711 0.476852    
## billAmt1    -7.948e-06  1.582e-06  -5.023 5.09e-07 ***
## billAmt2     4.911e-06  2.006e-06   2.448 0.014350 *  
## billAmt3     4.203e-07  1.698e-06   0.247 0.804572    
## billAmt4    -1.587e-08  1.872e-06  -0.008 0.993234    
## billAmt5     9.703e-07  2.154e-06   0.451 0.652293    
## billAmt6     6.758e-07  1.591e-06   0.425 0.670955    
## payAmt1     -1.878e-05  3.252e-06  -5.777 7.61e-09 ***
## payAmt2     -6.406e-06  2.364e-06  -2.710 0.006731 ** 
## payAmt3     -3.325e-06  2.401e-06  -1.385 0.166153    
## payAmt4     -3.922e-06  2.342e-06  -1.675 0.093970 .  
## payAmt5     -2.383e-06  2.168e-06  -1.099 0.271635    
## payAmt6     -1.916e-06  1.618e-06  -1.184 0.236521    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19438  on 17999  degrees of freedom
## Residual deviance: 17216  on 17976  degrees of freedom
## AIC: 17264
## 
## Number of Fisher Scoring iterations: 5
# Take a look at the odds
coefsexp <- coef(logitModelFull) %>% exp() %>% round(2)
coefsexp
## (Intercept)    limitBal         sex   education    marriage         age 
##        0.56        1.00        0.92        0.89        0.84        1.00 
##        pay1        pay2        pay3        pay4        pay5        pay6 
##        1.78        1.05        1.08        0.99        1.11        0.98 
##    billAmt1    billAmt2    billAmt3    billAmt4    billAmt5    billAmt6 
##        1.00        1.00        1.00        1.00        1.00        1.00 
##     payAmt1     payAmt2     payAmt3     payAmt4     payAmt5     payAmt6 
##        1.00        1.00        1.00        1.00        1.00        1.00
# The old (full) model
logitModelFull <- glm(PaymentDefault ~ limitBal + sex + education + marriage +
                   age + pay1 + pay2 + pay3 + pay4 + pay5 + pay6 + billAmt1 + 
                   billAmt2 + billAmt3 + billAmt4 + billAmt5 + billAmt6 + payAmt1 + 
                   payAmt2 + payAmt3 + payAmt4 + payAmt5 + payAmt6, 
                 family = binomial, defaultData)

#Build the new model
logitModelNew <- MASS::stepAIC(logitModelFull, trace=0) 

#Look at the model
summary(logitModelNew) 
## 
## Call:
## glm(formula = PaymentDefault ~ limitBal + sex + education + marriage + 
##     age + pay1 + pay2 + pay3 + pay5 + billAmt1 + billAmt2 + billAmt5 + 
##     payAmt1 + payAmt2 + payAmt3 + payAmt4, family = binomial, 
##     data = defaultData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0839  -0.7119  -0.5611  -0.2839   4.1800  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.699e-01  1.504e-01  -3.790 0.000151 ***
## limitBal    -5.201e-07  1.954e-07  -2.661 0.007791 ** 
## sex         -8.206e-02  3.878e-02  -2.116 0.034338 *  
## education   -1.212e-01  2.744e-02  -4.418 9.96e-06 ***
## marriage    -1.724e-01  4.014e-02  -4.295 1.75e-05 ***
## age          4.863e-03  2.256e-03   2.156 0.031092 *  
## pay1         5.740e-01  2.218e-02  25.882  < 2e-16 ***
## pay2         4.979e-02  2.552e-02   1.951 0.051048 .  
## pay3         7.197e-02  2.573e-02   2.798 0.005146 ** 
## pay5         8.859e-02  2.249e-02   3.938 8.20e-05 ***
## billAmt1    -8.130e-06  1.580e-06  -5.144 2.69e-07 ***
## billAmt2     5.238e-06  1.775e-06   2.951 0.003165 ** 
## billAmt5     1.790e-06  8.782e-07   2.038 0.041554 *  
## payAmt1     -1.931e-05  3.258e-06  -5.928 3.06e-09 ***
## payAmt2     -6.572e-06  2.092e-06  -3.142 0.001681 ** 
## payAmt3     -3.693e-06  2.187e-06  -1.689 0.091241 .  
## payAmt4     -4.611e-06  2.062e-06  -2.237 0.025306 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19438  on 17999  degrees of freedom
## Residual deviance: 17220  on 17983  degrees of freedom
## AIC: 17254
## 
## Number of Fisher Scoring iterations: 5
# Save the formula of the new model (it will be needed for the out-of-sample part) 
formulaLogit <- as.formula(summary(logitModelNew)$call)
formulaLogit
## PaymentDefault ~ limitBal + sex + education + marriage + age + 
##     pay1 + pay2 + pay3 + pay5 + billAmt1 + billAmt2 + billAmt5 + 
##     payAmt1 + payAmt2 + payAmt3 + payAmt4
# Make predictions using the full Model
defaultData$predFull <- predict(logitModelFull, type = "response", na.action = na.exclude)

# Construct the in-sample confusion matrix
confMatrixModelFull <- SDMTools::confusion.matrix(defaultData$PaymentDefault, 
                                                  defaultData$predFull, 
                                                  threshold = 0.5
                                                  )
confMatrixModelFull
##     obs
## pred     0    1
##    0 13441 3154
##    1   409  996
## attr(,"class")
## [1] "confusion.matrix"
# Calculate the accuracy for the full Model
accuracyFull <- sum(diag(confMatrixModelFull)) / sum(confMatrixModelFull)
accuracyFull
## [1] 0.8020556
# Calculate the accuracy for 'logitModelNew'
# Make prediction
defaultData$predNew <- predict(logitModelNew, type = "response", na.action = na.exclude)

# Construct the in-sample confusion matrix
confMatrixModelNew <- SDMTools::confusion.matrix(defaultData$PaymentDefault, 
                                                 defaultData$predNew, 
                                                 threshold = 0.5
                                                 )
confMatrixModelNew
##     obs
## pred     0    1
##    0 13443 3152
##    1   407  998
## attr(,"class")
## [1] "confusion.matrix"
# Calculate the accuracy...
accuracyNew <- sum(diag(confMatrixModelNew)) / sum(confMatrixModelNew)
accuracyNew
## [1] 0.8022778
# and compare it to the full model's accuracy
accuracyFull
## [1] 0.8020556
accuracyNew
## [1] 0.8022778
# Prepare data frame with threshold values and empty payoff column
payoffMatrix <- data.frame(threshold = seq(from = 0.1, to = 0.5, by = 0.1), payoff = NA) 
payoffMatrix
##   threshold payoff
## 1       0.1     NA
## 2       0.2     NA
## 3       0.3     NA
## 4       0.4     NA
## 5       0.5     NA
for(i in 1:length(payoffMatrix$threshold)) {
  # Calculate confusion matrix with varying threshold
  confMatrix <- SDMTools::confusion.matrix(defaultData$PaymentDefault, 
                                           defaultData$predNew, 
                                           threshold = payoffMatrix$threshold[i]
                                           )
  # Calculate payoff and save it to the corresponding row
  payoffMatrix$payoff[i] <- confMatrix[1, 1]*250 + confMatrix[1, 2]*(-1000)
}
payoffMatrix
##   threshold payoff
## 1       0.1 306750
## 2       0.2 752750
## 3       0.3 888000
## 4       0.4 641250
## 5       0.5 208750
# Split data in train and test set
set.seed(534381) 
defaultData$isTrain <- rbinom(nrow(defaultData), 1, 0.66)
train <- subset(defaultData, isTrain == 1)
test <- subset(defaultData, isTrain  == 0)

logitTrainNew <- glm(formulaLogit, family = binomial, data = train) # Modeling
test$predNew <- predict(logitTrainNew, type = "response", newdata = test) # Predictions

# Out-of-sample confusion matrix and accuracy
confMatrixModelNew <- SDMTools::confusion.matrix(test$PaymentDefault, test$predNew, threshold = 0.3) 
sum(diag(confMatrixModelNew)) / sum(confMatrixModelNew) # Compare this value to the in-sample accuracy
## [1] 0.7797764
# Accuracy function
costAcc <- function(r, pi = 0) {
  cm <- SDMTools::confusion.matrix(r, pi, threshold = 0.3)
  acc <- sum(diag(cm)) / sum(cm)
  return(acc)
}

# Cross validated accuracy for logitModelNew
set.seed(534381)
boot::cv.glm(defaultData, logitModelNew, cost = costAcc, K = 6)$delta[1]
## [1] 0.7862778

Chapter 3 - Modeling Time to Reorder with Survival Analysis

Survival Analysis Introduction:

  • Often have “censored” data, meaning that the customer journeys are not yet complete
    • Random Type I Right censoring is the most common - a point can only be observed if it has occurred before time X, and it is otherwise unknowable (but known that they have not yet churned)
    • Can plot histograms of whether someone has churned depending on the length of time
    • plotTenure <- dataSurv %>% mutate(churn = churn %>% factor(labels = c(“No”, “Yes”))) %>%
    • ggplot() + geom_histogram(aes(x = tenure, fill = factor(churn))) + facet_grid( ~ churn) +
    • theme(legend.position = “none”)
  • Survival analysis attempts to estimate when something will happen (churn, second order, renewal, etc.)

Survival curve analysis by Kaplan-Meier:

  • Begin by creating a new object containing the survival attribute
    • cbind(dataSurv %>% select(tenure, churn), surv = Surv(dataSurv\(tenure, dataSurv\)churn)) %>% head(10)
  • The survival function is the probability of “no event” in cumulative by time t
    • The hazard function is the cumulative probability of “event” by time t
    • The “hazard rate” is the probability of the event happening in a small time, provided that it has not yet happened
  • The Kaplan-Meier analysis can be used to estimate survival
    • fitKM <- survival::survfit(Surv(dataSurv\(tenure, dataSurv\)churn) ~ 1, type = “kaplan-meier”)
    • print(fitKM) # gives a few rough summary statistics
    • plot(fitKM) # survival curve with confidence interval
    • fitKMstr <- survfit(Surv(tenure, churn) ~ Partner, data = dataSurv) # add covariates, such as ~ Partner rather than ~1 as in the baseline

Cox PH model with constant covariates:

  • Model definition: cannot parse to ISO - see Excel notes
    • Predictors are lineary and multiplicatively related to the hazard function, lambda
    • Relative hazard function needs to remain constant over time
  • Fitting a survival model in R
    • library(rms)
    • units(dataSurv$tenure) <- “Month”
    • dd <- datadist(dataSurv)
    • options(datadist = “dd”)
    • fitCPH1 <- cph(Surv(tenure, churn) ~ gender + SeniorCitizen + Partner + Dependents + StreamMov + PaperlessBilling + PayMeth + MonthlyCharges, data = dataSurv, x = TRUE, y = TRUE, surv = TRUE, time.inc = 1)
    • Coefficient interpretation is relatively similar to logistic regression - exp(fitCPH1$coefficients) - can simplify the coefficients be making them multiplicative (1.00 is no impact)
    • survplot(fitCPH1, MonthlyCharges, label.curves = list(keys = 1:5)) # plots the survival probabilities based on varying 1 variable, assuming other variables constant
    • survplot(fitCPH1, Partner) # covariate with partner, plotted
    • plot(summary(fitCPH1), log = TRUE) # visualizing the hazard ratios

Checking model assumptions and making predictions:

  • Can again use the Cox PH function
    • testCPH1 <- cox.zph(fitCPH1)
    • print(testCPH1) # if p < 0.05, can reject the assumption that the predictor meets the proportional hazard assumption
    • plot(testCPH1, var = “Partner=Yes”)
    • plot(testCPH1, var = “MonthlyCharges”)
    • This test is conservative and sensitive to the number of observations
  • If the PH (proportional hazard) assumptions are violated, can correct for this using
    • fitCPH2 <- cph(Surv(tenure, churn) ~ MonthlyCharges + SeniorCitizen + Partner + Dependents + StreamMov + Contract, stratum = “gender = Male”, data = dataSurv, x = TRUE, y = TRUE, surv = TRUE)
    • rms::validate(fitCPH1, method = “crossvalidation”, B = 10, pr = FALSE) # pr=FALSE means only print at the end; R2 is the R-squared corrected by cross-validation
  • Can then assess probabilities for the event to occur
    • oneNewData <- data.frame(gender = “Female”, SeniorCitizen = “Yes”, Partner = “No”, Dependents = “Yes”, StreamMov = “Yes”, PaperlessBilling = “Yes”, PayMeth = “BankTrans(auto)”, MonthlyCharges = 37.12)
    • str(survest(fitCPH1, newdata = oneNewData, times = 3))
    • plot(survfit(fitCPH1, newdata = oneNewData))
    • print(survfit(fitCPH1, newdata = oneNewData))

Example code includes:

survData <- readr::read_delim("./RInputFiles/survivalDataExercise.csv", delim=",")
## Parsed with column specification:
## cols(
##   daysSinceFirstPurch = col_integer(),
##   shoppingCartValue = col_double(),
##   gender = col_character(),
##   voucher = col_integer(),
##   returned = col_integer(),
##   boughtAgain = col_integer()
## )
dataNextOrder <- survData %>%
    select(daysSinceFirstPurch, boughtAgain)

# Look at the head of the data
head(dataNextOrder)
## # A tibble: 6 x 2
##   daysSinceFirstPurch boughtAgain
##                 <int>       <int>
## 1                  37           0
## 2                  63           1
## 3                  48           0
## 4                  17           1
## 5                  53           0
## 6                  11           1
# Plot a histogram
ggplot(dataNextOrder) +
  geom_histogram(aes(x = daysSinceFirstPurch, fill = factor(boughtAgain))) +
  facet_grid( ~ boughtAgain) + # Separate plots for boughtAgain = 1 vs. 0
  theme(legend.position = "none") # Don't show legend
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Create survival object
survObj <- survival::Surv(dataNextOrder$daysSinceFirstPurch, dataNextOrder$boughtAgain)

# Look at structure
str(survObj)
##  Surv [1:5122, 1:2]  37+  63   48+  17   53+  11   22   16   74+  44  ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:2] "time" "status"
##  - attr(*, "type")= chr "right"
# Compute and print fit
fitKMSimple <- survival::survfit(survObj ~ 1)
print(fitKMSimple)
## Call: survfit(formula = survObj ~ 1)
## 
##       n  events  median 0.95LCL 0.95UCL 
##    5122    3199      41      40      42
# Plot fit
plot(fitKMSimple, conf.int = FALSE, xlab = "Time since first purchase", 
     ylab = "Survival function", main = "Survival function"
     )

dataNextOrder <- survData %>%
    select(daysSinceFirstPurch, boughtAgain, voucher)

# Compute fit with categorical covariate
fitKMCov <- survival::survfit(survObj ~ voucher, data = dataNextOrder)

# Plot fit with covariate and add labels
plot(fitKMCov, lty = 2:3, xlab = "Time since first purchase", 
     ylab = "Survival function", main = "Survival function"
     )
legend(90, .9, c("No", "Yes"), lty = 2:3)

dataNextOrder <- survData

# Determine distributions of predictor variables
dd <- rms::datadist(dataNextOrder)
options(datadist = "dd")

# Compute Cox PH Model and print results
fitCPH <- rms::cph(survival::Surv(daysSinceFirstPurch, boughtAgain) ~ 
                       shoppingCartValue + voucher + returned + gender, data = dataNextOrder, 
                   x = TRUE, y = TRUE, surv = TRUE
                   )
print(fitCPH)
## Cox Proportional Hazards Model
##  
##  rms::cph(formula = survival::Surv(daysSinceFirstPurch, boughtAgain) ~ 
##      shoppingCartValue + voucher + returned + gender, data = dataNextOrder, 
##      x = TRUE, y = TRUE, surv = TRUE)
##  
##                       Model Tests       Discrimination    
##                                            Indexes        
##  Obs       5122    LR chi2    155.68    R2       0.030    
##  Events    3199    d.f.            4    Dxy      0.116    
##  Center -0.2808    Pr(> chi2) 0.0000    g        0.238    
##                    Score chi2 140.57    gr       1.269    
##                    Pr(> chi2) 0.0000                      
##  
##                    Coef    S.E.   Wald Z Pr(>|Z|)
##  shoppingCartValue -0.0021 0.0003 -7.56  <0.0001 
##  voucher           -0.2945 0.0480 -6.14  <0.0001 
##  returned          -0.3145 0.0495 -6.36  <0.0001 
##  gender=male        0.1080 0.0363  2.97  0.0029  
## 
# Interpret coefficients
exp(fitCPH$coefficients)
## shoppingCartValue           voucher          returned       gender=male 
##         0.9978601         0.7449362         0.7301667         1.1140891
# Plot result summary
plot(summary(fitCPH), log = TRUE)

# Check proportional hazard assumption and print result
testCPH <- survival::cox.zph(fitCPH)
print(testCPH)
##                       rho chisq      p
## shoppingCartValue -0.0168 0.907 0.3409
## voucher           -0.0155 0.770 0.3803
## returned           0.0261 2.182 0.1397
## gender=male        0.0390 4.922 0.0265
## GLOBAL                 NA 8.528 0.0740
# Plot time-dependent beta
plot(testCPH, var = "gender=male")

# Validate model
rms::validate(fitCPH, method = "crossvalidation", B = 10, dxy = TRUE, pr = FALSE)
##       index.orig training   test optimism index.corrected  n
## Dxy       0.1159   0.1160 0.1145   0.0014          0.1144 10
## R2        0.0299   0.0300 0.0288   0.0013          0.0287 10
## Slope     1.0000   1.0000 0.9733   0.0267          0.9733 10
## D         0.0032   0.0033 0.0042  -0.0009          0.0041 10
## U         0.0000   0.0000 0.0002  -0.0002          0.0002 10
## Q         0.0032   0.0033 0.0040  -0.0007          0.0040 10
## g         0.2380   0.2382 0.2320   0.0062          0.2318 10
# Create data with new customer
newCustomer <- data.frame(daysSinceFirstPurch = 21, shoppingCartValue = 99.9, gender = "female", 
                          voucher = 1, returned = 0, stringsAsFactors = FALSE
                          )

# Make predictions
pred <- survival::survfit(fitCPH, newdata = newCustomer)
print(pred)
## Call: survfit(formula = fitCPH, newdata = newCustomer)
## 
##       n  events  median 0.95LCL 0.95UCL 
##    5122    3199      47      44      49
plot(pred)

# Correct the customer's gender
newCustomer2 <- newCustomer
newCustomer2$gender <- "male"

# Redo prediction
pred2 <- survival::survfit(fitCPH, newdata = newCustomer2)
print(pred2)
## Call: survfit(formula = fitCPH, newdata = newCustomer2)
## 
##       n  events  median 0.95LCL 0.95UCL 
##    5122    3199      44      42      47

Chapter 4 - Reducing Dimensionality with Principal Component Analysis

PCA for CRM Data - address mutlicollinearity and data volume issues in the raw CRM data:

  • PCA reduces a large number of correlated variables to a smaller number of uncorrelated (orthogonal) variables
  • PCA can also help with creating an index, such as using the first component of the PCA
  • All variables must be either continuous or binary prior to running the PCA analysis
    • dataCustomers %>% cor() %>% corrplot() # plot the initial correlations

PCA Computation:

  • Need to manage for variance, otherwise high-variance variables will be over-represented in the PCA
    • lapply(dataCustomers, var)
    • dataCustomers <- dataCustomers %>% scale() %>% as.data.frame()
    • pcaCust <- prcomp(dataCustomers)
    • pcaCust$sdev %>% round(2) # standard deviations by component
    • pcaCust$sdev ^ 2 %>% round(2) # variances, also known as eigenvalues, by component give a good sense for relative importance (relative ratio is percent of variance explained)
    • round(pcaCust$rotation[, 1:6], 2) # correlations between original variables and principal components (can use these to give descriptive names to components)
  • Values of the observations are the weightings for the PC to make up the underlying data
    • sum(dataCustomers[1,] * pcaCust$rotation[,1]) # Value on 1st component for 1st customer
    • pcaCust$x[1:5, 1:6] # first 5 customers and first 6 component loadings (weightings)

PCA Model Specification:

  • Need to decide on how many components to keep - balance size of data vs. reconstruction of original data
    • Can set a minimum requirement for percentage of variance explained (such as 70%)
    • summary(pcaCust) # will show cumulatives also
    • Can use the Kaiser-Guttman criteria, which keeps only components with an eigenvalue of 1 (since 1 is the average)
    • Can also draw a scree plot to see the variances (eigenvalues) in descending order - look for an elbow
    • screeplot(pcaCust, type = “lines”)
    • Generally, use a few different techniques, and pick a number that is “in the range”
  • The biplot can help to show how the data map on to the principal components
    • biplot(pcaCust, choices = 1:2, cex = 0.7) # will show PC1 and PC2, with arrows for the various features and how they map on them

Principal components in a regression analysis:

  • PCA can help to solve the multi-collinearity problem in a regression
    • dataCustComponents <- cbind(dataCustomers[, “customerSatis”], pcaCust$x[, 1:6]) %>% as.data.frame
    • mod2 <- lm(customerSatis ~ ., dataCustComponents)
    • vif(mod2) # by construction, these will all be 1, since the principal components are orthogonal
  • Factor analysis is another dimension-reduction technique, sometimes confused with PCA
    • Factor analysis theorizes that latent constructs (e.g., intelligence) which cannot be directly measured are influencing the observed variables
    • Factor analysis is often used in questionnaires - factor analysis can investigate where multiple questions really just measure one thing
    • In contrast, with PCA, the features are actually being combined to model the data

Wrap up:

  • Logistic regression for churn
  • Survival analysis to prevent churn
  • Principal component analysis (PCA) to reduce multicollinearity

Example code includes:

load("./RInputFiles/newsData.RData")

rawData <- newsData
newsData <- newsData[, c('n_tokens_title', 'n_tokens_content', 'n_unique_tokens', 'num_hrefs', 'num_self_hrefs', 'num_imgs', 'num_videos', 'num_keywords', 'is_weekend', 'kw_avg_min', 'kw_avg_avg', 'kw_avg_max', 'average_token_length', 'global_subjectivity', 'global_sentiment_polarity', 'global_rate_positive_words', 'global_rate_negative_words', 'avg_positive_polarity', 'avg_negative_polarity', 'title_subjectivity', 'title_sentiment_polarity')]


# Overview of data structure:
str(newsData, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    39644 obs. of  21 variables:
##  $ n_tokens_title            : num  12 9 9 9 13 10 8 12 11 10 ...
##  $ n_tokens_content          : num  219 255 211 531 1072 ...
##  $ n_unique_tokens           : num  0.664 0.605 0.575 0.504 0.416 ...
##  $ num_hrefs                 : num  4 3 3 9 19 2 21 20 2 4 ...
##  $ num_self_hrefs            : num  2 1 1 0 19 2 20 20 0 1 ...
##  $ num_imgs                  : num  1 1 1 1 20 0 20 20 0 1 ...
##  $ num_videos                : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ num_keywords              : num  5 4 6 7 7 9 10 9 7 5 ...
##  $ is_weekend                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_min                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_avg                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_max                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ average_token_length      : num  4.68 4.91 4.39 4.4 4.68 ...
##  $ global_subjectivity       : num  0.522 0.341 0.702 0.43 0.514 ...
##  $ global_sentiment_polarity : num  0.0926 0.1489 0.3233 0.1007 0.281 ...
##  $ global_rate_positive_words: num  0.0457 0.0431 0.0569 0.0414 0.0746 ...
##  $ global_rate_negative_words: num  0.0137 0.01569 0.00948 0.02072 0.01213 ...
##  $ avg_positive_polarity     : num  0.379 0.287 0.496 0.386 0.411 ...
##  $ avg_negative_polarity     : num  -0.35 -0.119 -0.467 -0.37 -0.22 ...
##  $ title_subjectivity        : num  0.5 0 0 0 0.455 ...
##  $ title_sentiment_polarity  : num  -0.188 0 0 0 0.136 ...
# Correlation structure:
newsData %>% cor() %>% corrplot::corrplot()

# Standardize data
newsData <- newsData %>% scale() %>% as.data.frame()

# Compute PCA
pcaNews <- newsData %>% prcomp()

# Eigenvalues
pcaNews$sdev**2
##  [1] 3.31015107 2.00241491 1.82662819 1.67421238 1.30249854 1.20443028
##  [7] 1.02889482 1.00052438 0.97929267 0.95905061 0.82676492 0.74951891
## [13] 0.73162009 0.66351863 0.62319656 0.57949073 0.47020594 0.41516936
## [19] 0.29926492 0.27690363 0.07624847
# Screeplot:
screeplot(pcaNews, type = "lines")

# Cumulative explained variance:
summary(pcaNews)
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6
## Standard deviation     1.8194 1.41507 1.35153 1.29391 1.14127 1.09747
## Proportion of Variance 0.1576 0.09535 0.08698 0.07972 0.06202 0.05735
## Cumulative Proportion  0.1576 0.25298 0.33996 0.41969 0.48171 0.53906
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     1.01434 1.00026 0.98959 0.97931 0.90927 0.86575
## Proportion of Variance 0.04899 0.04764 0.04663 0.04567 0.03937 0.03569
## Cumulative Proportion  0.58806 0.63570 0.68234 0.72800 0.76737 0.80307
##                           PC13   PC14    PC15    PC16    PC17    PC18
## Standard deviation     0.85535 0.8146 0.78943 0.76124 0.68572 0.64434
## Proportion of Variance 0.03484 0.0316 0.02968 0.02759 0.02239 0.01977
## Cumulative Proportion  0.83790 0.8695 0.89918 0.92677 0.94916 0.96893
##                           PC19    PC20    PC21
## Standard deviation     0.54705 0.52622 0.27613
## Proportion of Variance 0.01425 0.01319 0.00363
## Cumulative Proportion  0.98318 0.99637 1.00000
# Kaiser-Guttmann (number of components with eigenvalue larger than 1):
sum(pcaNews$sdev > 1)
## [1] 8
# Print loadings of the first six components
pcaNews$rotation[, 1:6] %>% round(2)
##                              PC1   PC2   PC3   PC4   PC5   PC6
## n_tokens_title             -0.05 -0.10  0.01 -0.10  0.20 -0.28
## n_tokens_content            0.23 -0.17 -0.38  0.12  0.15 -0.02
## n_unique_tokens             0.00  0.00  0.00  0.01  0.01  0.06
## num_hrefs                   0.26 -0.16 -0.42 -0.03  0.07  0.11
## num_self_hrefs              0.20 -0.07 -0.39  0.06  0.12  0.08
## num_imgs                    0.14 -0.15 -0.43 -0.06  0.04  0.08
## num_videos                  0.09 -0.20  0.04 -0.19  0.16 -0.14
## num_keywords                0.07  0.11 -0.25  0.14 -0.42 -0.30
## is_weekend                  0.05 -0.01 -0.12 -0.02 -0.10 -0.16
## kw_avg_min                  0.03  0.01 -0.05 -0.25 -0.65  0.07
## kw_avg_avg                  0.02 -0.15 -0.06 -0.61 -0.31  0.17
## kw_avg_max                 -0.10 -0.21  0.10 -0.50  0.35  0.26
## average_token_length        0.39 -0.02  0.19  0.19 -0.01  0.14
## global_subjectivity         0.45 -0.01  0.23 -0.04 -0.03  0.03
## global_sentiment_polarity   0.25  0.55 -0.03 -0.19  0.11  0.13
## global_rate_positive_words  0.33  0.25  0.14 -0.08  0.04 -0.09
## global_rate_negative_words  0.15 -0.47  0.23  0.11 -0.10 -0.21
## avg_positive_polarity       0.42  0.09  0.17 -0.06  0.02  0.10
## avg_negative_polarity      -0.25  0.37 -0.20 -0.04  0.08  0.06
## title_subjectivity          0.07 -0.03  0.01 -0.27  0.07 -0.61
## title_sentiment_polarity    0.07  0.24 -0.11 -0.24  0.15 -0.42
pcaNews %>% biplot(choices=1:2, cex = 0.5)

# Predict log shares with all original variables
logShares <- rawData %>%
    select(shares) %>%
    mutate(logShares=log(1+shares)) %>%
    pull(logShares) %>%
    scale()

newsData <- newsData %>%
    cbind(logShares)

mod1 <- lm(logShares ~ ., data = newsData)

# Create dataframe with log shares and first 6 components
dataNewsComponents <- cbind(logShares = newsData[, "logShares"], pcaNews$x[, 1:6]) %>%
  as.data.frame()

# Predict log shares with first six components
mod2 <- lm(logShares ~ ., data = dataNewsComponents)

# Print adjusted R squared for both models
summary(mod1)$adj.r.squared
## [1] 0.07954578
summary(mod2)$adj.r.squared
## [1] 0.05066316

Interactive Maps with leaflet in R

Chapter 1 - Setting Up Interactive Web Maps

Introduction to leaflet - open-source JavaScript library that makes interactive, mobile-friendly maps:

  • Objective for this course is to build up to an interactive map of 4-year colleges, including incorporation type (public, private, etc.)
    • Additionally, labels that occur when hovering
  • Leaflet builds maps using tiles, which join many smaller maps together
    • library(leaflet)
    • leaflet() %>% addTiles() # zooming and scrolling lead to new tiles being shown
  • In Chapter 1, will use multiple tile types to create maps of the DataCamp HQ in Belgium and Boston
    • leaflet() %>% addProviderTiles(“CartoDB”) %>% addMarkers(lng = dc_hq\(lon, lat = dc_hq\)lat, popup = dc_hq$hq)

Map tiles - over 100 pre-canned maps that are available as bases:

  • Selecting a base map - consider the intended purpose of the map, and ensure that the maps selected meet that purpose
    • Instructor has a preference for gray-scale maps (for ease of seeing other data)
  • The base maps are stored as “providers” - most are available for immediate use, but a few require registration
    • names(providers) # get all the available providers
    • names(providers)[str_detect(names(providers), “OpenStreetMap”)] # all from OpenStreetMap
    • leaflet() %>% # addTiles() addProviderTiles(“OpenStreetMap.BlackAndWhite”) # replace the default with the BW OpenStreetMap

Setting the default map view:

  • Can load the map centered on a specific point and with a requested zoom level - coomon to use ggmap::geocode()
    • ggmap::geocode(“350 5th Ave, New York, NY 10118”) # will return the lat-lon where possible (uses google API unless source=“dsk” is chosen)
  • Can use either setView() or fitBounds()
    • leaflet() %>% addTiles() %>% setView(lng = -73.98575, lat = 40.74856, zoom = 13) # setView picks a lat/lon and zoom
    • leaflet() %>% addTiles() %>% fitBounds( lng1 = -73.910, lat1 = 40.773, lng2 = -74.060, lat2 = 40.723) # fitBounds defines a rectangle
  • Can limit user controls such as panning and zooming
    • leaflet(options = leafletOptions(dragging = FALSE, minZoom = 14, maxZoom = 18)) %>% addProviderTiles(“CartoDB”) %>% setView(lng = -73.98575, lat = 40.74856, zoom = 18)
    • dragging=FALSE removes the ability to pan
    • maxZoom and minZoom limit the options for zooming
    • leaflet() %>% addTiles() %>% setView(lng = -73.98575, lat = 40.74856, zoom = 18) %>% setMaxBounds(lng1 = -73.98575, lat1 = 40.74856, lng2 = -73.98575, lat2 = 40.74856)
    • setMaxBounds() limits the user to the boundaries that you pre-specify
  • For more information, can go to

Plotting DataCamp HQ:

  • Location markers are a common addition, managed using addMarkers()
    • leaflet() %>% addTiles() %>% addMarkers(lng = -73.98575, lat = 40.74856)
    • If single vectors are passed to lng and lat, then a single blue pin will be placed and the map will be centered/zoomed there
    • dc_hq <- tibble( hq = c(“DataCamp - NYC”, “DataCamp - Belgium”), lon = c(-73.98575, 4.717863), lat = c(40.74856, 50.881363))
    • leaflet() %>% addTiles() %>% addMarkers(lng = dc_hq\(lon, lat = dc_hq\)lat)
    • When the tibble is passed, then the map will be zoomed/centered such that all the pins can be displayed dc_hq %>% leaflet() %>% addTiles() %>% addMarkers()
    • The functions will seek a lat and lon column from the piped in data (dc_hq in this case), and pass along a note that they were used
  • Pop-ups are a common way to provide additional information about a marker
    • leaflet() %>% addTiles() %>% addMarkers(lng = dc_hq\(lon, lat = dc_hq\)lat, popup = dc_hq$hq) # markers, with popup enabled on clicking
    • leaflet() %>% addTiles() %>% addPopups(lng = dc_hq\(lon, lat = dc_hq\)lat, popup = dc_hq$hq) # popups instead of markers
  • Leaflets can be stored as objects (similar to ggplot2), with additions and prints and whatnot called later

Example code includes:

# Load the leaflet library
library(leaflet)
## 
## Attaching package: 'leaflet'
## The following object is masked from 'package:xts':
## 
##     addLegend
# Create a leaflet map with default map tile using addTiles()
leaflet() %>%
    addTiles()
# Print the providers list included in the leaflet library
providers
## $OpenStreetMap
## [1] "OpenStreetMap"
## 
## $OpenStreetMap.Mapnik
## [1] "OpenStreetMap.Mapnik"
## 
## $OpenStreetMap.BlackAndWhite
## [1] "OpenStreetMap.BlackAndWhite"
## 
## $OpenStreetMap.DE
## [1] "OpenStreetMap.DE"
## 
## $OpenStreetMap.CH
## [1] "OpenStreetMap.CH"
## 
## $OpenStreetMap.France
## [1] "OpenStreetMap.France"
## 
## $OpenStreetMap.HOT
## [1] "OpenStreetMap.HOT"
## 
## $OpenStreetMap.BZH
## [1] "OpenStreetMap.BZH"
## 
## $OpenInfraMap
## [1] "OpenInfraMap"
## 
## $OpenInfraMap.Power
## [1] "OpenInfraMap.Power"
## 
## $OpenInfraMap.Telecom
## [1] "OpenInfraMap.Telecom"
## 
## $OpenInfraMap.Petroleum
## [1] "OpenInfraMap.Petroleum"
## 
## $OpenInfraMap.Water
## [1] "OpenInfraMap.Water"
## 
## $OpenSeaMap
## [1] "OpenSeaMap"
## 
## $OpenPtMap
## [1] "OpenPtMap"
## 
## $OpenTopoMap
## [1] "OpenTopoMap"
## 
## $OpenRailwayMap
## [1] "OpenRailwayMap"
## 
## $OpenFireMap
## [1] "OpenFireMap"
## 
## $SafeCast
## [1] "SafeCast"
## 
## $Thunderforest
## [1] "Thunderforest"
## 
## $Thunderforest.OpenCycleMap
## [1] "Thunderforest.OpenCycleMap"
## 
## $Thunderforest.Transport
## [1] "Thunderforest.Transport"
## 
## $Thunderforest.TransportDark
## [1] "Thunderforest.TransportDark"
## 
## $Thunderforest.SpinalMap
## [1] "Thunderforest.SpinalMap"
## 
## $Thunderforest.Landscape
## [1] "Thunderforest.Landscape"
## 
## $Thunderforest.Outdoors
## [1] "Thunderforest.Outdoors"
## 
## $Thunderforest.Pioneer
## [1] "Thunderforest.Pioneer"
## 
## $OpenMapSurfer
## [1] "OpenMapSurfer"
## 
## $OpenMapSurfer.Roads
## [1] "OpenMapSurfer.Roads"
## 
## $OpenMapSurfer.AdminBounds
## [1] "OpenMapSurfer.AdminBounds"
## 
## $OpenMapSurfer.Grayscale
## [1] "OpenMapSurfer.Grayscale"
## 
## $Hydda
## [1] "Hydda"
## 
## $Hydda.Full
## [1] "Hydda.Full"
## 
## $Hydda.Base
## [1] "Hydda.Base"
## 
## $Hydda.RoadsAndLabels
## [1] "Hydda.RoadsAndLabels"
## 
## $MapBox
## [1] "MapBox"
## 
## $Stamen
## [1] "Stamen"
## 
## $Stamen.Toner
## [1] "Stamen.Toner"
## 
## $Stamen.TonerBackground
## [1] "Stamen.TonerBackground"
## 
## $Stamen.TonerHybrid
## [1] "Stamen.TonerHybrid"
## 
## $Stamen.TonerLines
## [1] "Stamen.TonerLines"
## 
## $Stamen.TonerLabels
## [1] "Stamen.TonerLabels"
## 
## $Stamen.TonerLite
## [1] "Stamen.TonerLite"
## 
## $Stamen.Watercolor
## [1] "Stamen.Watercolor"
## 
## $Stamen.Terrain
## [1] "Stamen.Terrain"
## 
## $Stamen.TerrainBackground
## [1] "Stamen.TerrainBackground"
## 
## $Stamen.TopOSMRelief
## [1] "Stamen.TopOSMRelief"
## 
## $Stamen.TopOSMFeatures
## [1] "Stamen.TopOSMFeatures"
## 
## $Esri
## [1] "Esri"
## 
## $Esri.WorldStreetMap
## [1] "Esri.WorldStreetMap"
## 
## $Esri.DeLorme
## [1] "Esri.DeLorme"
## 
## $Esri.WorldTopoMap
## [1] "Esri.WorldTopoMap"
## 
## $Esri.WorldImagery
## [1] "Esri.WorldImagery"
## 
## $Esri.WorldTerrain
## [1] "Esri.WorldTerrain"
## 
## $Esri.WorldShadedRelief
## [1] "Esri.WorldShadedRelief"
## 
## $Esri.WorldPhysical
## [1] "Esri.WorldPhysical"
## 
## $Esri.OceanBasemap
## [1] "Esri.OceanBasemap"
## 
## $Esri.NatGeoWorldMap
## [1] "Esri.NatGeoWorldMap"
## 
## $Esri.WorldGrayCanvas
## [1] "Esri.WorldGrayCanvas"
## 
## $OpenWeatherMap
## [1] "OpenWeatherMap"
## 
## $OpenWeatherMap.Clouds
## [1] "OpenWeatherMap.Clouds"
## 
## $OpenWeatherMap.CloudsClassic
## [1] "OpenWeatherMap.CloudsClassic"
## 
## $OpenWeatherMap.Precipitation
## [1] "OpenWeatherMap.Precipitation"
## 
## $OpenWeatherMap.PrecipitationClassic
## [1] "OpenWeatherMap.PrecipitationClassic"
## 
## $OpenWeatherMap.Rain
## [1] "OpenWeatherMap.Rain"
## 
## $OpenWeatherMap.RainClassic
## [1] "OpenWeatherMap.RainClassic"
## 
## $OpenWeatherMap.Pressure
## [1] "OpenWeatherMap.Pressure"
## 
## $OpenWeatherMap.PressureContour
## [1] "OpenWeatherMap.PressureContour"
## 
## $OpenWeatherMap.Wind
## [1] "OpenWeatherMap.Wind"
## 
## $OpenWeatherMap.Temperature
## [1] "OpenWeatherMap.Temperature"
## 
## $OpenWeatherMap.Snow
## [1] "OpenWeatherMap.Snow"
## 
## $HERE
## [1] "HERE"
## 
## $HERE.normalDay
## [1] "HERE.normalDay"
## 
## $HERE.normalDayCustom
## [1] "HERE.normalDayCustom"
## 
## $HERE.normalDayGrey
## [1] "HERE.normalDayGrey"
## 
## $HERE.normalDayMobile
## [1] "HERE.normalDayMobile"
## 
## $HERE.normalDayGreyMobile
## [1] "HERE.normalDayGreyMobile"
## 
## $HERE.normalDayTransit
## [1] "HERE.normalDayTransit"
## 
## $HERE.normalDayTransitMobile
## [1] "HERE.normalDayTransitMobile"
## 
## $HERE.normalNight
## [1] "HERE.normalNight"
## 
## $HERE.normalNightMobile
## [1] "HERE.normalNightMobile"
## 
## $HERE.normalNightGrey
## [1] "HERE.normalNightGrey"
## 
## $HERE.normalNightGreyMobile
## [1] "HERE.normalNightGreyMobile"
## 
## $HERE.basicMap
## [1] "HERE.basicMap"
## 
## $HERE.mapLabels
## [1] "HERE.mapLabels"
## 
## $HERE.trafficFlow
## [1] "HERE.trafficFlow"
## 
## $HERE.carnavDayGrey
## [1] "HERE.carnavDayGrey"
## 
## $HERE.hybridDay
## [1] "HERE.hybridDay"
## 
## $HERE.hybridDayMobile
## [1] "HERE.hybridDayMobile"
## 
## $HERE.pedestrianDay
## [1] "HERE.pedestrianDay"
## 
## $HERE.pedestrianNight
## [1] "HERE.pedestrianNight"
## 
## $HERE.satelliteDay
## [1] "HERE.satelliteDay"
## 
## $HERE.terrainDay
## [1] "HERE.terrainDay"
## 
## $HERE.terrainDayMobile
## [1] "HERE.terrainDayMobile"
## 
## $FreeMapSK
## [1] "FreeMapSK"
## 
## $MtbMap
## [1] "MtbMap"
## 
## $CartoDB
## [1] "CartoDB"
## 
## $CartoDB.Positron
## [1] "CartoDB.Positron"
## 
## $CartoDB.PositronNoLabels
## [1] "CartoDB.PositronNoLabels"
## 
## $CartoDB.PositronOnlyLabels
## [1] "CartoDB.PositronOnlyLabels"
## 
## $CartoDB.DarkMatter
## [1] "CartoDB.DarkMatter"
## 
## $CartoDB.DarkMatterNoLabels
## [1] "CartoDB.DarkMatterNoLabels"
## 
## $CartoDB.DarkMatterOnlyLabels
## [1] "CartoDB.DarkMatterOnlyLabels"
## 
## $HikeBike
## [1] "HikeBike"
## 
## $HikeBike.HikeBike
## [1] "HikeBike.HikeBike"
## 
## $HikeBike.HillShading
## [1] "HikeBike.HillShading"
## 
## $BasemapAT
## [1] "BasemapAT"
## 
## $BasemapAT.basemap
## [1] "BasemapAT.basemap"
## 
## $BasemapAT.grau
## [1] "BasemapAT.grau"
## 
## $BasemapAT.overlay
## [1] "BasemapAT.overlay"
## 
## $BasemapAT.highdpi
## [1] "BasemapAT.highdpi"
## 
## $BasemapAT.orthofoto
## [1] "BasemapAT.orthofoto"
## 
## $nlmaps
## [1] "nlmaps"
## 
## $nlmaps.standaard
## [1] "nlmaps.standaard"
## 
## $nlmaps.pastel
## [1] "nlmaps.pastel"
## 
## $nlmaps.grijs
## [1] "nlmaps.grijs"
## 
## $nlmaps.luchtfoto
## [1] "nlmaps.luchtfoto"
## 
## $NASAGIBS
## [1] "NASAGIBS"
## 
## $NASAGIBS.ModisTerraTrueColorCR
## [1] "NASAGIBS.ModisTerraTrueColorCR"
## 
## $NASAGIBS.ModisTerraBands367CR
## [1] "NASAGIBS.ModisTerraBands367CR"
## 
## $NASAGIBS.ViirsEarthAtNight2012
## [1] "NASAGIBS.ViirsEarthAtNight2012"
## 
## $NASAGIBS.ModisTerraLSTDay
## [1] "NASAGIBS.ModisTerraLSTDay"
## 
## $NASAGIBS.ModisTerraSnowCover
## [1] "NASAGIBS.ModisTerraSnowCover"
## 
## $NASAGIBS.ModisTerraAOD
## [1] "NASAGIBS.ModisTerraAOD"
## 
## $NASAGIBS.ModisTerraChlorophyll
## [1] "NASAGIBS.ModisTerraChlorophyll"
## 
## $NLS
## [1] "NLS"
## 
## $JusticeMap
## [1] "JusticeMap"
## 
## $JusticeMap.income
## [1] "JusticeMap.income"
## 
## $JusticeMap.americanIndian
## [1] "JusticeMap.americanIndian"
## 
## $JusticeMap.asian
## [1] "JusticeMap.asian"
## 
## $JusticeMap.black
## [1] "JusticeMap.black"
## 
## $JusticeMap.hispanic
## [1] "JusticeMap.hispanic"
## 
## $JusticeMap.multi
## [1] "JusticeMap.multi"
## 
## $JusticeMap.nonWhite
## [1] "JusticeMap.nonWhite"
## 
## $JusticeMap.white
## [1] "JusticeMap.white"
## 
## $JusticeMap.plurality
## [1] "JusticeMap.plurality"
## 
## $Wikimedia
## [1] "Wikimedia"
# Print only the names of the map tiles in the providers list 
names(providers)
##   [1] "OpenStreetMap"                      
##   [2] "OpenStreetMap.Mapnik"               
##   [3] "OpenStreetMap.BlackAndWhite"        
##   [4] "OpenStreetMap.DE"                   
##   [5] "OpenStreetMap.CH"                   
##   [6] "OpenStreetMap.France"               
##   [7] "OpenStreetMap.HOT"                  
##   [8] "OpenStreetMap.BZH"                  
##   [9] "OpenInfraMap"                       
##  [10] "OpenInfraMap.Power"                 
##  [11] "OpenInfraMap.Telecom"               
##  [12] "OpenInfraMap.Petroleum"             
##  [13] "OpenInfraMap.Water"                 
##  [14] "OpenSeaMap"                         
##  [15] "OpenPtMap"                          
##  [16] "OpenTopoMap"                        
##  [17] "OpenRailwayMap"                     
##  [18] "OpenFireMap"                        
##  [19] "SafeCast"                           
##  [20] "Thunderforest"                      
##  [21] "Thunderforest.OpenCycleMap"         
##  [22] "Thunderforest.Transport"            
##  [23] "Thunderforest.TransportDark"        
##  [24] "Thunderforest.SpinalMap"            
##  [25] "Thunderforest.Landscape"            
##  [26] "Thunderforest.Outdoors"             
##  [27] "Thunderforest.Pioneer"              
##  [28] "OpenMapSurfer"                      
##  [29] "OpenMapSurfer.Roads"                
##  [30] "OpenMapSurfer.AdminBounds"          
##  [31] "OpenMapSurfer.Grayscale"            
##  [32] "Hydda"                              
##  [33] "Hydda.Full"                         
##  [34] "Hydda.Base"                         
##  [35] "Hydda.RoadsAndLabels"               
##  [36] "MapBox"                             
##  [37] "Stamen"                             
##  [38] "Stamen.Toner"                       
##  [39] "Stamen.TonerBackground"             
##  [40] "Stamen.TonerHybrid"                 
##  [41] "Stamen.TonerLines"                  
##  [42] "Stamen.TonerLabels"                 
##  [43] "Stamen.TonerLite"                   
##  [44] "Stamen.Watercolor"                  
##  [45] "Stamen.Terrain"                     
##  [46] "Stamen.TerrainBackground"           
##  [47] "Stamen.TopOSMRelief"                
##  [48] "Stamen.TopOSMFeatures"              
##  [49] "Esri"                               
##  [50] "Esri.WorldStreetMap"                
##  [51] "Esri.DeLorme"                       
##  [52] "Esri.WorldTopoMap"                  
##  [53] "Esri.WorldImagery"                  
##  [54] "Esri.WorldTerrain"                  
##  [55] "Esri.WorldShadedRelief"             
##  [56] "Esri.WorldPhysical"                 
##  [57] "Esri.OceanBasemap"                  
##  [58] "Esri.NatGeoWorldMap"                
##  [59] "Esri.WorldGrayCanvas"               
##  [60] "OpenWeatherMap"                     
##  [61] "OpenWeatherMap.Clouds"              
##  [62] "OpenWeatherMap.CloudsClassic"       
##  [63] "OpenWeatherMap.Precipitation"       
##  [64] "OpenWeatherMap.PrecipitationClassic"
##  [65] "OpenWeatherMap.Rain"                
##  [66] "OpenWeatherMap.RainClassic"         
##  [67] "OpenWeatherMap.Pressure"            
##  [68] "OpenWeatherMap.PressureContour"     
##  [69] "OpenWeatherMap.Wind"                
##  [70] "OpenWeatherMap.Temperature"         
##  [71] "OpenWeatherMap.Snow"                
##  [72] "HERE"                               
##  [73] "HERE.normalDay"                     
##  [74] "HERE.normalDayCustom"               
##  [75] "HERE.normalDayGrey"                 
##  [76] "HERE.normalDayMobile"               
##  [77] "HERE.normalDayGreyMobile"           
##  [78] "HERE.normalDayTransit"              
##  [79] "HERE.normalDayTransitMobile"        
##  [80] "HERE.normalNight"                   
##  [81] "HERE.normalNightMobile"             
##  [82] "HERE.normalNightGrey"               
##  [83] "HERE.normalNightGreyMobile"         
##  [84] "HERE.basicMap"                      
##  [85] "HERE.mapLabels"                     
##  [86] "HERE.trafficFlow"                   
##  [87] "HERE.carnavDayGrey"                 
##  [88] "HERE.hybridDay"                     
##  [89] "HERE.hybridDayMobile"               
##  [90] "HERE.pedestrianDay"                 
##  [91] "HERE.pedestrianNight"               
##  [92] "HERE.satelliteDay"                  
##  [93] "HERE.terrainDay"                    
##  [94] "HERE.terrainDayMobile"              
##  [95] "FreeMapSK"                          
##  [96] "MtbMap"                             
##  [97] "CartoDB"                            
##  [98] "CartoDB.Positron"                   
##  [99] "CartoDB.PositronNoLabels"           
## [100] "CartoDB.PositronOnlyLabels"         
## [101] "CartoDB.DarkMatter"                 
## [102] "CartoDB.DarkMatterNoLabels"         
## [103] "CartoDB.DarkMatterOnlyLabels"       
## [104] "HikeBike"                           
## [105] "HikeBike.HikeBike"                  
## [106] "HikeBike.HillShading"               
## [107] "BasemapAT"                          
## [108] "BasemapAT.basemap"                  
## [109] "BasemapAT.grau"                     
## [110] "BasemapAT.overlay"                  
## [111] "BasemapAT.highdpi"                  
## [112] "BasemapAT.orthofoto"                
## [113] "nlmaps"                             
## [114] "nlmaps.standaard"                   
## [115] "nlmaps.pastel"                      
## [116] "nlmaps.grijs"                       
## [117] "nlmaps.luchtfoto"                   
## [118] "NASAGIBS"                           
## [119] "NASAGIBS.ModisTerraTrueColorCR"     
## [120] "NASAGIBS.ModisTerraBands367CR"      
## [121] "NASAGIBS.ViirsEarthAtNight2012"     
## [122] "NASAGIBS.ModisTerraLSTDay"          
## [123] "NASAGIBS.ModisTerraSnowCover"       
## [124] "NASAGIBS.ModisTerraAOD"             
## [125] "NASAGIBS.ModisTerraChlorophyll"     
## [126] "NLS"                                
## [127] "JusticeMap"                         
## [128] "JusticeMap.income"                  
## [129] "JusticeMap.americanIndian"          
## [130] "JusticeMap.asian"                   
## [131] "JusticeMap.black"                   
## [132] "JusticeMap.hispanic"                
## [133] "JusticeMap.multi"                   
## [134] "JusticeMap.nonWhite"                
## [135] "JusticeMap.white"                   
## [136] "JusticeMap.plurality"               
## [137] "Wikimedia"
# Use str_detect() to determine if the name of each provider tile contains the string "CartoDB"
str_detect(names(providers), "CartoDB")
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
## [100]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE
# Use str_detect() to print only the provider tile names that include the string "CartoDB"
names(providers)[str_detect(names(providers), "CartoDB")]
## [1] "CartoDB"                      "CartoDB.Positron"            
## [3] "CartoDB.PositronNoLabels"     "CartoDB.PositronOnlyLabels"  
## [5] "CartoDB.DarkMatter"           "CartoDB.DarkMatterNoLabels"  
## [7] "CartoDB.DarkMatterOnlyLabels"
# Change addTiles() to addProviderTiles() and set the provider argument to "CartoDB"
leaflet() %>% 
    addProviderTiles("CartoDB")
# Create a leaflet map that uses the Esri provider tile 
leaflet() %>% 
    addProviderTiles("Esri")
# Create a leaflet map that uses the CartoDB.PositronNoLabels provider tile
leaflet() %>% 
    addProviderTiles("CartoDB.PositronNoLabels")
# Map with CartoDB tile centered on DataCamp's NYC office with zoom of 6
leaflet()  %>% 
    addProviderTiles("CartoDB")  %>% 
    setView(lng = -73.98575, lat = 40.74856, zoom = 6)
dc_hq <- tibble::tibble(hq=c("NYC", "Belgium"), lon=c(-73.98575, 4.71786), lat=c(40.7486, 50.8814))
dc_hq
## # A tibble: 2 x 3
##   hq         lon   lat
##   <chr>    <dbl> <dbl>
## 1 NYC     -74.0   40.7
## 2 Belgium   4.72  50.9
# Map with CartoDB.PositronNoLabels tile centered on DataCamp's Belgium office with zoom of 4
leaflet() %>% 
    addProviderTiles("CartoDB.PositronNoLabels") %>% 
    setView(lng = dc_hq$lon[2], lat = dc_hq$lat[2], zoom = 4)
leaflet(options = leafletOptions(
                    # Set minZoom and dragging 
                    minZoom = 12, dragging = TRUE))  %>% 
  addProviderTiles("CartoDB")  %>% 
  # Set default zoom level 
  setView(lng = dc_hq$lon[2], lat = dc_hq$lat[2], zoom = 14) %>% 
  # Set max bounds of map 
  setMaxBounds(lng1 = dc_hq$lon[2] + 0.05, 
               lat1 = dc_hq$lat[2] + .05, 
               lng2 = dc_hq$lon[2] - 0.05, 
               lat2 = dc_hq$lat[2] - .05) 
# Plot DataCamp's NYC HQ
leaflet() %>% 
    addProviderTiles("CartoDB") %>% 
    addMarkers(lng = dc_hq$lon[1], lat = dc_hq$lat[1])
# Plot DataCamp's NYC HQ with zoom of 12    
leaflet() %>% 
    addProviderTiles("CartoDB") %>% 
    addMarkers(lng = -73.98575, lat = 40.74856)  %>% 
    setView(lng = -73.98575, lat = 40.74856, zoom = 12)    
# Plot both DataCamp's NYC and Belgium locations
leaflet() %>% 
    addProviderTiles("CartoDB") %>% 
    addMarkers(lng = dc_hq$lon, lat = dc_hq$lat)
# Store leaflet hq map in an object called map
map <- leaflet() %>% 
    addProviderTiles("CartoDB") %>% 
    # add hq column of dc_hq as popups
    addMarkers(lng = dc_hq$lon, lat = dc_hq$lat, 
               popup = dc_hq$hq
               )

# Center the view of map on the Belgium HQ with a zoom of 5  
map_zoom <- map %>% 
      setView(lat = 50.881363, lng = 4.717863, zoom = 5)

# Print map_zoom
map_zoom

Chapter 2 - Plotting points

Introduction to IPEDS Data:

  • Can clear the boundaries of a map, while keeping everything else (data and the like) constant
    • m %>% clearBounds() # kills the bounds layers
    • m %>% clearBounds() %>% clearMarkers() # kills the markers layers
  • The IPEDS data is the Integrated Post-Secondary Education dataset - this course uses a subset consisting of 4-year colleges
    • Goal is to create a subset of the IPEDS data consisting of the ~300 colleges in California
    • Can then plot and color-code the California colleges

Mapping California colleges:

  • Clustered markers are poorly shown by pins due to obscuring
  • A nice alternative is to use circle markers, which have much less tendency for overlaps
    • maine_colleges_map %>% clearMarkers() %>% addCircleMarkers(data = maine, radius = 3)
    • maine_colleges_map %>% addCircleMarkers( data = maine_colleges, radius = 4, color = “red”, popup = ~name) # custom color and radius while maintaining popups

Labels and pop-ups:

  • Can use piping as well as the tilde, which allows for referring to key variables in the piped in data
    • ipeds %>% leaflet() %>% addProviderTiles(“CartoDB”) %>% addCircleMarkers( lng = ~lng, lat = ~lat, popup = ~name, color = “#FF0000”)
    • Colors can be specified using hexadecimal, as shown in the example above - can find these using google and color sliders
  • Can build better popups using pipes and tildes
    • addCircleMarkers(popup = ~paste0(name, “-”, sector_label)
    • addCircleMarkers(popup = ~paste0(“”,name,“”,“
      ”,sector_label)) # enhanced with html tags
  • Labels provide similar information as pop-ups, but require only a hover rather than a click
    • ipeds %>% leaflet() %>% addProviderTiles(“CartoDB”) %>% addCircleMarkers(label = ~name, radius = 2)

Color coding colleges:

  • Can include differential colors depending on a variables that has been piped in using colorFactor()
    • OR <- ipeds %>% filter(state == “OR”)
    • pal <- colorFactor(palette = c(“red”, “blue”, “#9b4a11”), levels = c(“Public”, “Private”, “For-Profit”)) # create the color palette for future use
    • oregon_colleges <- OR %>% leaflet() %>% addProviderTiles(“CartoDB”) %>% addCircleMarkers(radius = 2, color = ~pal(sector_label), label = ~name) # apply as pal()
    • oregon_colleges %>% addLegend(position = “bottomright”, pal = pal, values = c(“Public”, “Private”, “For-Profit”)) # add to legend
  • Can instead color based on a numeric value using colorNumeric()
    • admit <- admit %>% filter(!is.na(rate), rate < 50, rate > 0) # filer for rates that exist and are between 0 and 50
    • pal <- colorNumeric(palette = “Reds”, domain = c(1:50), reverse = TRUE) # reverse=TRUE flips the gradient so that lower admit rates are darker red
    • admit_map <- admit %>% leaflet() %>% addProviderTiles(“CartoDB”) %>% addCircleMarkers(radius = 4, color = ~pal(rate), label = ~name) %>% addLegend(title = “Admit Rate”, pal = pal, values = c(1:50), position = “bottomright”)
  • Can use RColorBrewer for default color palettes
    • library(RColorBrewer)
    • display.brewer.all()

Example code includes:

# Remove markers, reset bounds, and store the updated map in the m object
map <- map %>%
    clearMarkers() %>% 
    clearBounds()

# Print the cleared map
map
ipedsRaw <- readr::read_csv("./RInputFiles/ipeds.csv")
## Parsed with column specification:
## cols(
##   name = col_character(),
##   lng = col_double(),
##   lat = col_double(),
##   state = col_character(),
##   sector_label = col_character()
## )
# Remove colleges with missing sector information
ipeds <- 
    ipedsRaw %>% 
    tidyr::drop_na()

# Count the number of four-year colleges in each state
ipeds %>% 
    group_by(state)  %>% 
    count()
## # A tibble: 56 x 2
## # Groups:   state [56]
##    state     n
##    <chr> <int>
##  1 AK        6
##  2 AL       45
##  3 AR       26
##  4 AS        1
##  5 AZ       50
##  6 CA      272
##  7 CO       53
##  8 CT       33
##  9 DC       18
## 10 DE        7
## # ... with 46 more rows
# Create a list of US States in descending order by the number of colleges in each state
ipeds  %>% 
    group_by(state)  %>% 
    count()  %>% 
    arrange(desc(n))
## # A tibble: 56 x 2
## # Groups:   state [56]
##    state     n
##    <chr> <int>
##  1 CA      272
##  2 NY      239
##  3 PA      164
##  4 FL      159
##  5 TX      154
##  6 OH      135
##  7 IL      119
##  8 MA      103
##  9 MO       87
## 10 MN       82
## # ... with 46 more rows
# Create a dataframe called `ca` with data on only colleges in California
ca <- ipeds %>%
    filter(state == "CA")

map <- leaflet() %>% 
    addProviderTiles("CartoDB")

# Use `addMarkers` to plot all of the colleges in `ca` on the `m` leaflet map
map %>%
    addMarkers(lng = ca$lng, lat = ca$lat)
la_coords <- data.frame(lat = 34.05223, lon = -118.2437) 

# Center the map on LA 
map %>% 
    addMarkers(data = ca) %>% 
    setView(lat = la_coords$lat, lng = la_coords$lon, zoom = 12)
## Assuming "lng" and "lat" are longitude and latitude, respectively
# Set the zoom level to 8 and store in the m object
map_zoom <-
    map %>%
    addMarkers(data = ca) %>%
    setView(lat = la_coords$lat, lng = la_coords$lon, zoom = 8)
## Assuming "lng" and "lat" are longitude and latitude, respectively
map_zoom
# Clear the markers from the map 
map2 <- map %>% clearMarkers()

# Use addCircleMarkers() to plot each college as a circle
map2 %>%
    addCircleMarkers(lng = ca$lng, lat = ca$lat)
# Change the radius of each circle to be 2 pixels and the color to red
map2 %>% 
    addCircleMarkers(lng = ca$lng, lat = ca$lat, radius = 2, color = "red")
# Add circle markers with popups for college names
map %>%
    addCircleMarkers(data = ca, radius = 2, popup = ~name)
## Assuming "lng" and "lat" are longitude and latitude, respectively
# Change circle color to #2cb42c and store map in map_color object
map_color <- map %>% 
    addCircleMarkers(data = ca, radius = 2, color = "#2cb42c", popup = ~name)
## Assuming "lng" and "lat" are longitude and latitude, respectively
# Print map_color
map_color
# Clear the bounds and markers on the map object and store in map2
map2 <- map %>% 
    clearBounds() %>% 
    clearMarkers()

# Add circle markers with popups that display both the institution name and sector
map2 %>% 
    addCircleMarkers(data = ca, radius = 2, 
                     popup = ~paste0(name, "<br/>", sector_label)
                     )
## Assuming "lng" and "lat" are longitude and latitude, respectively
# Make the institution name in each popup bold
map2 %>% 
    addCircleMarkers(data = ca, radius = 2, 
                     popup = ~paste0("<b>", name, "</b>", "<br/>", sector_label)
                     )
## Assuming "lng" and "lat" are longitude and latitude, respectively
# Add circle markers with labels identifying the name of each college
map %>% 
    addCircleMarkers(data = ca, radius = 2, label = ~name)
## Assuming "lng" and "lat" are longitude and latitude, respectively
# Use paste0 to add sector information to the label inside parentheses 
map %>% 
    addCircleMarkers(data = ca, radius = 2, label = ~paste0(name, " (", sector_label, ")"))
## Assuming "lng" and "lat" are longitude and latitude, respectively
# Make a color palette called pal for the values of `sector_label` using `colorFactor()`  
# Colors should be: "red", "blue", and "#9b4a11" for "Public", "Private", and "For-Profit" colleges, respectively
pal <- colorFactor(palette = c("red", "blue", "#9b4a11"), 
                   levels = c("Public", "Private", "For-Profit")
                   )

# Add circle markers that color colleges using pal() and the values of sector_label
map2 <- map %>% 
        addCircleMarkers(data = ca, radius = 2, 
                         color = ~pal(sector_label), 
                         label = ~paste0(name, " (", sector_label, ")")
                         )
## Assuming "lng" and "lat" are longitude and latitude, respectively
# Print map2
map2
# Add a legend that displays the colors used in pal
map2 %>% 
    addLegend(pal = pal, values = c("Public", "Private", "For-Profit"))
# Customize the legend
map2 %>% 
    addLegend(pal = pal, 
              values = c("Public", "Private", "For-Profit"),
              # opacity of .5, title of Sector, and position of topright
              opacity = 0.5, title = "Sector", position = "topright"
              )

Chapter 3 - Groups, Layers, Extras

Leaflet Extras Package:

  • The leaflet.extras package provides some nice extensibility to the baseline leaflet package
    • leaflet() %>% addTiles() %>% addSearchOSM() # searching open-source-maps (magnifying glass icon with search box)
    • leaflet() %>% addTiles() %>% addSearchOSM() %>% addReverseSearchOSM() # can also use geocode to find a click, as requested by addReverseSearchOSM()
    • leaflet() %>% addTiles() %>% addSearchOSM() %>% addReverseSearchOSM() %>% addResetMapButton() # can click “reset” to return to the default view

Overlay Groups - ability to control the segments that are displayed on the map:

  • One option is to segment the data in advance, then to add as layers using addCircleMarkers
    • ca_public <- ipeds %>% filter(sector == “Public”, state == “CA”)
    • m %>% addCircleMarkers( data = ca_public, group = “Public”)
  • After creating multiple calls for addCircleMarkers(), each with group=, can then activate the grouping
    • addLayersControl( overlayGroups = c(“Public”, “Private”, “For-Profit”))
  • Since the layers are stacked, the order in which they are added matters (they layer/stack on top of each other)

Base Groups - can provide multiple options for toggling (only one may be selected at a time):

  • Need to call addProviderTiles() once for each layer that is an option, then activate using addLayersControl()
    • a <- leaflet() %>% addTiles(group = “OSM”) %>% addProviderTiles(“CartoDB”, group = “Carto”) %>% addProviderTiles(“Esri”, group = “Esri”)
    • a %>% addLayersControl(baseGroups = c(“OSM”, “Carto”, “Esri”), position = “topleft”)
  • Can be handy to try a few different base groups during exploratory analysis, to find the base that best matches the rest of the analysis
  • Basic four-step process for building up the base groups includes
    • leaflet() %>% # initialize leaflet map
    • addTiles(group = “OSM”) %>% addProviderTiles(“CartoDB”, group = “Carto”) %>% addProviderTiles(“Esri”, group = “Esri”) %>% # add basemaps with groups
    • addCircleMarkers(data = public, radius = 2, label = ~htmlEscape(name), color = ~pal(sector_label), group = “Public”) %>% addCircleMarkers(data = private, radius = 2, label = ~htmlEscape(name), color = ~pal(sector_label), group = “Private”) %>% addCircleMarkers(data = profit, radius = 2, label = ~htmlEscape(name), color = ~pal(sector_label), group = “For-Profit”) %>% # add marker layer for each sector with corresponding group name
    • addLayersControl(baseGroups = c(“OSM”, “Carto”, “Esri”), overlayGroups = c(“Public”, “Private”, “For-Profit”)) # add layer controls for base and overlay groups

Pieces of Flair:

  • Can customize ths search function using leaflet.extra capability
    • ca_public <- ipeds %>% filter(sector_label == “Public”, state == “CA”)
    • ca_public %>% leaflet() %>% addProviderTiles(“Esri”) %>% addCircleMarkers(radius = 2, label = ~htmlEscape(name), color = ~pal(sector_label), group = “Public”) %>% addSearchFeatures(targetGroups = ‘Public’, options = searchFeaturesOptions(zoom = 10)) # will filter the search on Public data, with a specified zoom
  • Can cluster the colleges to improve readability of the maps
    • ipeds %>% leaflet() %>% addTiles() %>% addCircleMarkers(radius = 2, color = ~pal(sector_label), clusterOptions = markerClusterOptions()) # many colleges in one circle

Example code includes:

library(leaflet.extras)
library(htmltools)

leaflet() %>%
  addTiles() %>% 
  addSearchOSM() %>% 
  addReverseSearchOSM() 
m2 <- ipeds %>% 
    leaflet() %>% 
    # use the CartoDB provider tile
    addProviderTiles("CartoDB") %>% 
    # center on the middle of the US with zoom of 3
    setView(lat = 39.8282, lng = -98.5795, zoom=3)

# Map all American colleges 
m2 %>% 
    addCircleMarkers() 
## Assuming "lng" and "lat" are longitude and latitude, respectively
# Create data frame called public with only public colleges
public <- filter(ipeds, sector_label == "Public")  

# Create a leaflet map of public colleges called m3 
m3 <- leaflet() %>% 
    addProviderTiles("CartoDB") %>% 
    addCircleMarkers(data = public, radius = 2, label = ~htmlEscape(name), 
                     color = ~pal(sector_label), group = "Public"
                     )
## Assuming "lng" and "lat" are longitude and latitude, respectively
m3
# Create data frame called private with only private colleges
private <- filter(ipeds, sector_label == "Private")  

# Add private colleges to `m3` as a new layer
m3 <- m3 %>% 
    addCircleMarkers(data = private, radius = 2, label = ~htmlEscape(name), 
                     color = ~pal(sector_label), group = "Private"
                     ) %>% 
    addLayersControl(overlayGroups = c("Public", "Private"))
## Assuming "lng" and "lat" are longitude and latitude, respectively
m3
# Create data frame called profit with only for-profit colleges
profit <- filter(ipeds, sector_label == "For-Profit")  

# Add for-profit colleges to `m3` as a new layer
m3 <- m3 %>% 
    addCircleMarkers(data = profit, radius = 2, label = ~htmlEscape(name), 
                     color = ~pal(sector_label),   group = "For-Profit"
                     )  %>% 
    addLayersControl(overlayGroups = c("Public", "Private", "For-Profit"))  
## Assuming "lng" and "lat" are longitude and latitude, respectively
# Center the map on the middle of the US with a zoom of 4
m4 <- m3 %>%
    setView(lat = 39.8282, lng = -98.5795, zoom = 4) 
        
m4
leaflet() %>% 
  # Add the OSM, CartoDB and Esri tiles
  addTiles(group = "OSM") %>% 
  addProviderTiles("CartoDB", group = "Carto") %>% 
  addProviderTiles("Esri", group = "Esri") %>% 
  # Use addLayersControl to allow users to toggle between basemaps
  addLayersControl(baseGroups = c("OSM", "Carto", "Esri"))
m4 <- leaflet() %>% 
    addTiles(group = "OSM") %>% 
    addProviderTiles("CartoDB", group = "Carto") %>% 
    addProviderTiles("Esri", group = "Esri") %>% 
    addCircleMarkers(data = public, radius = 2, label = ~htmlEscape(name), 
                     color = ~pal(sector_label),  group = "Public"
                     ) %>% 
    addCircleMarkers(data = private, radius = 2, label = ~htmlEscape(name), 
                     color = ~pal(sector_label), group = "Private"
                     )  %>% 
    addCircleMarkers(data = profit, radius = 2, label = ~htmlEscape(name), 
                     color = ~pal(sector_label), group = "For-Profit"
                     )  %>% 
    addLayersControl(baseGroups = c("OSM", "Carto", "Esri"), 
                     overlayGroups = c("Public", "Private", "For-Profit")
                     ) %>% 
    setView(lat = 39.8282, lng = -98.5795, zoom = 4) 
## Assuming "lng" and "lat" are longitude and latitude, respectively
## Assuming "lng" and "lat" are longitude and latitude, respectively
## Assuming "lng" and "lat" are longitude and latitude, respectively
m4
ipeds %>% 
    leaflet() %>% 
    addTiles() %>% 
    # Sanitize any html in our labels
    addCircleMarkers(radius = 2, label = ~htmlEscape(name), 
                     # Color code colleges by sector using the `pal` color palette 
                     color = ~pal(sector_label), 
                     # Cluster all colleges using `clusterOptions` 
                     clusterOptions = markerClusterOptions()
                     ) 
## Assuming "lng" and "lat" are longitude and latitude, respectively

Chapter 4 - Plotting Polygons

Spatial Data - ability to plot polygons rather than points:

  • Polygons have many points, and are stored in SPDF (Spatial Polygons Data Frame) with 5 slots
    • data - one observation per polygon
    • polygons - coordinates to plot each polygon
    • plotOrder - order for plotting
    • bbox - rectangle containing all the polygons
    • proj4string - coordinate reference system (CRS)
    • All accessed using the @ symbol
  • Can join from the data component of the SPDF, accessed using @
    • hp@data <- shp@data %>% left_join(nc_income, by = c(“GEOID10” = “zipcode”))
    • shp@polygons[[1]] %>% leaflet() %>% addPolygons() # can plot a single polygon

Mapping Polygons - can pipe SPDF in to a series of leaflet calls:

  • The basic polygon plotting method using leaflet() may produce shape boundaries that are too thick
    • shp %>% leaflet() %>% addTiles() %>% addPolygons()
    • weight - thickness of lines
    • color - color of lines
    • label - information shown on hover
    • highlight - options to highlight polygon on hover
  • The refined plotting approach adds customization for better readability
    • shp %>% leaflet() %>% addTiles() %>% addPolygons(weight = 1, color = “grey”, label = ~paste0(“Total Income:” dollar(income)), highlight = highlightOptions(weight = 3, color = “red”, bringToFront = TRUE))
  • Can color numeric data when plotting polygons
    • colorNumeric - maps continuous data to interpolated palettes
    • colorBin - colors based on cut function
    • colorQuantile - colors based on quantile
    • nc_pal <- colorNumeric(palette = “Blues”, domain = high_inc@data$mean_income)
    • nc_pal <- colorBin(palette = “YlGn”, bins = 5, domain = high_inc@data$mean_income)
    • nc_pal <- colorQuantile(palette = “YlGn”, n = 4, domain = high_inc@data$mean_income)
  • Example of coloring using colorNumeric()
    • nc_pal <- colorNumeric(“Blues”, domain = high_inc@data$mean_income)
    • previewColors(pal = nc_pal, values = c(seq(100000, 600000, by = 100000))) # explore sample values
    • shp %>% leaflet() %>% # addTiles() %>% addPolygons(weight = 1, fillOpacity = 1, color = ~nc_pal(mean_income), label = ~paste0(“Mean Income:”, dollar(mean_income)), highlight = highlightOptions(weight = 3, color = “red”, bringToFront = TRUE))
  • Sometimes need to log-transform skewed data for better displays

Putting Everything Together:

  • Leaflet and htmlwidgets for base maps and coloring
  • Base and overlay groups to enhance interactivity
  • Features available in the leaflet.extras function
  • Can piece together a full map that includes both polygons and circle markers
    • leaflet() %>% addTiles(group = “OSM”) %>% addProviderTiles(“CartoDB”, group = “Carto”) %>%
    • addProviderTiles(“Esri”, group = “Esri”) %>%
    • addPolygons(data = shp, weight = 1, fillOpacity = .75, color = ~nc_pal(log(mean_income)), label = ~paste0(“Mean Income:”, dollar(mean_income)), group = “Mean Income”) %>%
    • addCircleMarkers(data = nc_public, radius = 2, label = ~htmlEscape(name), color = ~pal(sector_label), group = “Public”) %>%
    • addCircleMarkers(data = nc_private, radius = 2, label = ~htmlEscape(name), color = ~pal(sector_label), group = “Private”) %>%
    • addCircleMarkers(data = nc_profit, radius = 2, label = ~htmlEscape(name), color = ~pal(sector_label), group = “For-Profit”) %>%
    • addLayersControl(baseGroups = c(“OSM”, “Carto”, “Esri”), overlayGroups = c(“Public”, “Private”, “For-Profit”, “Mean Income”))
  • Can also save a map for future use
    • m <- leaflet() %>% addTiles() %>% addMarkers( data = ipeds, clusterOptions = markerClusterOptions())%>% addPolygons(data = shp)
    • library(htmlwidgets)
    • saveWidget(m, file=“myMap.html”) # saves the file as html

Wrap up - additional resources:

Example code includes:

load("./RInputFiles/nc_zips.Rda")
load("./RInputFiles/wealthiest_zips.Rda")
nc_income <- readr::read_csv("./RInputFiles/mean_income_by_zip_nc.csv")
## Parsed with column specification:
## cols(
##   zipcode = col_integer(),
##   returns = col_integer(),
##   income = col_double(),
##   mean_income = col_double()
## )
str(nc_income, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    723 obs. of  4 variables:
##  $ zipcode    : int  28207 28211 27608 28480 27517 27614 28173 28036 27408 28226 ...
##  $ returns    : int  4470 14060 5690 1510 12710 15670 21880 7640 9100 19240 ...
##  $ income     : num  2.46e+09 3.32e+09 1.13e+09 2.41e+08 1.97e+09 ...
##  $ mean_income: num  550849 235961 197725 159617 154682 ...
# Print a summary of the `shp` data
summary(shp)
## Loading required package: sp
## Object of class SpatialPolygonsDataFrame
## Coordinates:
##         min       max
## x -84.32187 -75.46089
## y  33.84232  36.58812
## Is projected: FALSE 
## proj4string :
## [+init=epsg:4326 +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84
## +towgs84=0,0,0]
## Data attributes:
##     GEOID10         ALAND10   
##  27006  :  1   100240769:  1  
##  27007  :  1   100252722:  1  
##  27009  :  1   1003885  :  1  
##  27011  :  1   100620829:  1  
##  27012  :  1   100707703:  1  
##  27013  :  1   101001856:  1  
##  (Other):802   (Other)  :802
# Print the class of `shp`
class(shp)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
# Print the slot names of `shp`
slotNames(shp)
## [1] "data"        "polygons"    "plotOrder"   "bbox"        "proj4string"
# Glimpse the data slot of shp
glimpse(shp@data)
## Observations: 808
## Variables: 2
## $ GEOID10 <fct> 27925, 28754, 28092, 27217, 28711, 28666, 28602, 27841...
## $ ALAND10 <fct> 624688620, 223734670, 317180853, 318965510, 258603117,...
# Print the class of the data slot of shp
class(shp@data)
## [1] "data.frame"
# Print GEOID10
shp@data$GEOID10
##   [1] 27925 28754 28092 27217 28711 28666 28602 27841 27831 28785 27504
##  [12] 27330 28768 28658 28716 28139 27565 28394 27982 28025 28159 28382
##  [23] 28312 28342 27839 27852 28723 28077 28039 28452 27306 28375 28713
##  [34] 28743 28717 28150 28447 27205 27379 28425 27827 27540 28114 28451
##  [45] 27892 27249 28628 27873 28781 27916 28705 28714 28101 28102 28445
##  [56] 28448 28458 28719 28478 28479 28501 28748 28752 28207 28753 28757
##  [67] 28209 28212 28560 28504 27983 27985 28018 28019 28562 28906 28530
##  [78] 28771 28779 28782 28376 28581 28152 28169 28170 28657 28021 28204
##  [89] 28533 28540 28543 28551 28262 28280 28575 28790 28792 28667 28672
## [100] 28108 28462 28681 28465 28734 28739 28694 28697 28702 28745 28127
## [111] 28420 28422 28424 28428 28435 28088 28089 28090 27562 28334 28787
## [122] 28433 27360 27534 28043 27370 28444 27531 28675 28712 28449 27053
## [133] 27944 28367 28326 28740 28659 28282 27244 27597 27017 28761 28457
## [144] 28441 27956 27889 28652 28146 28513 28777 28786 27596 27530 28369
## [155] 28327 27340 27028 27823 27879 28244 27810 27886 28306 27025 27239
## [166] 27967 27824 27826 27834 27030 28358 28365 27520 27524 27525 27526
## [177] 27292 27874 27882 27883 27885 27253 27576 27577 27582 27295 27298
## [188] 27332 27910 27052 27055 27344 27516 27850 27856 27265 27603 27605
## [199] 27537 27539 27541 28601 28604 27809 27278 27284 27371 27201 27312
## [210] 28320 28325 27207 28330 28607 28611 28612 27549 27555 27317 27320
## [221] 27703 27709 28350 28643 28337 28621 27569 28645 28651 27948 28630
## [232] 27923 27929 27936 27943 28721 28512 27546 27891 28379 27822 27909
## [243] 28655 28662 27587 27589 28625 28742 28553 27941 28134 27043 27893
## [254] 28328 28135 28007 28338 27110 28472 28756 28110 28519 27861 27407
## [265] 28374 28211 28668 27214 27965 27949 27806 28340 27917 27288 27563
## [276] 28669 27229 27283 27109 27843 27047 28303 28585 28676 28689 28305
## [287] 28635 28640 27016 27863 27968 28528 27915 27981 28411 28577 27326
## [298] 27954 28556 27105 27545 27813 27974 27301 28168 28670 28801 27050
## [309] 28610 28665 28125 28538 27849 28036 28586 27801 27807 28904 27875
## [320] 28557 27958 28468 27536 28213 28341 28747 28707 27262 28006 28360
## [331] 28031 27845 28166 28616 27572 27014 27503 27011 28572 28386 27291
## [342] 28432 27804 27343 28073 28467 28173 28539 28352 27828 28515 28555
## [353] 27855 27583 28310 28396 28348 28138 28642 27542 27408 28215 27821
## [364] 28105 28270 28206 28301 27876 28627 27019 28574 28647 28806 27349
## [375] 28091 28660 28726 28508 27840 28803 28511 27964 27978 28086 27927
## [386] 28774 28383 27559 28523 28332 28749 27962 27455 28056 27501 28027
## [397] 27527 27282 27837 28682 27310 28356 27233 27231 27006 28144 27857
## [408] 27042 28314 27612 28525 27281 28147 28366 28629 27523 27937 28119
## [419] 28012 27048 27880 27350 27027 27606 27938 28638 28720 28580 27103
## [430] 27986 28001 28034 28393 28032 28040 28677 28395 28391 28678 28399
## [441] 28455 28098 28401 28103 28684 28685 28409 28071 28683 28083 28708
## [452] 28097 28450 28431 28453 28454 28709 28439 28377 28715 28443 28436
## [463] 28438 28751 28129 28133 28763 28109 28120 28466 28746 28137 28480
## [474] 28759 28731 28762 28405 28054 28698 28081 28403 28052 28701 28690
## [485] 28412 28704 28078 28421 28693 28544 28516 28773 28775 28905 28174
## [496] 28203 28570 28208 28210 28202 28804 28805 28791 28901 28547 28107
## [507] 28722 28729 28461 28730 28463 28552 28554 28115 28732 28112 28214
## [518] 28733 28308 28304 28571 28584 28582 28583 28273 28587 28278 28578
## [529] 28579 28323 28164 28605 28518 28520 28526 28783 28529 28167 28521
## [540] 28531 28311 28163 28537 28772 28626 27942 27928 28634 28649 28339
## [551] 28357 27935 28623 28618 28654 28624 28619 27922 28307 28226 27946
## [562] 27947 28347 28349 28227 28637 27926 27920 28646 28573 27921 28351
## [573] 28269 28590 27341 28364 27604 27976 28615 27357 28344 28613 28609
## [584] 28343 27409 27376 27377 27701 27610 27979 27405 27704 27705 27959
## [595] 27960 27403 27966 27953 27970 27972 27973 27707 27957 27401 27517
## [606] 27502 27507 27508 27509 27510 27518 27505 27020 27613 27024 27514
## [617] 27519 27713 27614 27803 27616 27617 27513 27511 27023 27046 27844
## [628] 27869 27853 27051 27041 27521 27871 27872 27842 27106 27830 27846
## [639] 27013 27862 27104 27832 27847 27858 27865 27851 27825 27829 27012
## [650] 27816 27817 27557 27808 27209 27208 27820 27888 27814 27551 27556
## [661] 27045 27235 27560 27215 27054 27248 27242 27260 27243 27258 27581
## [672] 27812 27601 27592 27591 27544 27316 27313 27325 27314 27311 27896
## [683] 27007 28650 28606 27009 28735 28673 28725 28033 27870 27864 28429
## [694] 28384 28663 27022 28333 27574 28524 28527 28277 27263 28023 27573
## [705] 27615 28020 28464 28128 28009 28205 28104 27299 27884 28076 28080
## [716] 28160 28532 27302 28124 27932 27924 28037 27819 27608 28789 28079
## [727] 28398 27553 27878 27018 27040 28392 27315 28594 27950 28442 27410
## [738] 27805 28371 27305 28778 28692 28072 28456 28589 28363 27355 27358
## [749] 28385 28736 27890 27522 28617 28671 28387 28390 27212 27609 27568
## [760] 28679 27881 27101 28622 28644 28631 28636 28373 28345 27712 28117
## [771] 27866 27021 27406 28741 28372 27897 28430 27980 28017 27203 28909
## [782] 27127 27607 27939 28217 28216 27252 28423 28718 27919 28510 28460
## [793] 28434 28470 28766 28546 27818 27529 28469 28016 28075 28318 27107
## [804] 27356 28315 27571 27860 28902
## 33144 Levels: 00601 00602 00603 00606 00610 00612 00616 00617 00622 ... 99929
shp@data$GEOID10 <- as.integer(as.character(shp@data$GEOID10))
str(shp@data$GEOID10)
##  int [1:808] 27925 28754 28092 27217 28711 28666 28602 27841 27831 28785 ...
# Glimpse the nc_income data
glimpse(nc_income)
## Observations: 723
## Variables: 4
## $ zipcode     <int> 28207, 28211, 27608, 28480, 27517, 27614, 28173, 2...
## $ returns     <int> 4470, 14060, 5690, 1510, 12710, 15670, 21880, 7640...
## $ income      <dbl> 2462295000, 3317607000, 1125055000, 241022000, 196...
## $ mean_income <dbl> 550849.0, 235960.7, 197725.0, 159617.2, 154682.2, ...
# Summarise the nc_income data
summary(nc_income)
##     zipcode         returns          income           mean_income    
##  Min.   :27006   Min.   :  110   Min.   :4.557e+06   Min.   : 26625  
##  1st Qu.:27605   1st Qu.: 1105   1st Qu.:4.615e+07   1st Qu.: 40368  
##  Median :28115   Median : 3050   Median :1.526e+08   Median : 46288  
##  Mean   :28062   Mean   : 5979   Mean   :3.648e+08   Mean   : 53338  
##  3rd Qu.:28521   3rd Qu.: 9050   3rd Qu.:4.670e+08   3rd Qu.: 55917  
##  Max.   :28909   Max.   :37020   Max.   :3.970e+09   Max.   :550849
# Left join nc_income onto shp@data and store in shp_nc_income
shp_nc_income <- shp@data %>% 
                left_join(nc_income, by = c("GEOID10" = "zipcode"))

# Print the number of missing values of each variable in shp_nc_income
shp_nc_income %>%
  summarise_all(funs(sum(is.na(.))))
##   GEOID10 ALAND10 returns income mean_income
## 1       0       0      85     85          85
shp <- merge(shp, shp_nc_income, by=c("GEOID10", "ALAND10"))


# map the polygons in shp
shp %>% 
    leaflet() %>% 
    addTiles() %>% 
    addPolygons()
# which zips were not in the income data?
shp_na <- shp[is.na(shp$mean_income),]

# map the polygons in shp_na
shp_na %>% 
    leaflet() %>% 
    addTiles() %>% 
    addPolygons()
# summarise the mean income variable
summary(shp$mean_income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   26620   40370   46290   53340   55920  550800      85
# subset shp to include only zip codes in the top quartile of mean income
high_inc <- shp[!is.na(shp$mean_income) & shp$mean_income > 55917,]

# map the boundaries of the zip codes in the top quartile of mean income
high_inc %>%
  leaflet() %>%
  addTiles() %>%
  addPolygons()
dollar <- function (x, negative_parens=TRUE, prefix="$", suffix="") {
    # KLUGE to make this work . . . 
    needs_cents <- function(...) { FALSE }
    if (length(x) == 0) 
        return(character())
    x <- plyr::round_any(x, 0.01)
    if (needs_cents(x, largest_with_cents)) {
        nsmall <- 2L
    }
    else {
        x <- plyr::round_any(x, 1)
        nsmall <- 0L
    }
    negative <- !is.na(x) & x < 0
    if (negative_parens) {
        x <- abs(x)
    }
    amount <- format(abs(x), nsmall = nsmall, trim = TRUE, big.mark = ",", scientific = FALSE, digits = 1L)
    if (negative_parens) {
        paste0(ifelse(negative, "(", ""), prefix, amount, suffix, ifelse(negative, ")", ""))
    }
    else {
        paste0(prefix, ifelse(negative, "-", ""), amount, suffix)
    }
}


# create color palette with colorNumeric()
nc_pal <- colorNumeric("YlGn", domain = high_inc@data$mean_income)

high_inc %>%
  leaflet() %>%
  addTiles() %>%
  # set boundary thickness to 1 and color polygons blue
  addPolygons(weight = 1, color = ~nc_pal(mean_income),
              # add labels that display mean income
              label = ~paste0("Mean Income: ", dollar(mean_income)),
              # highlight polygons on hover
              highlight = highlightOptions(weight = 5, color = "white",
              bringToFront = TRUE))
# Create a logged version of the nc_pal color palette
nc_pal <- colorNumeric("YlGn", domain = log(high_inc@data$mean_income))

# apply the nc_pal
high_inc %>%
  leaflet() %>%
  addProviderTiles("CartoDB") %>%
  addPolygons(weight = 1, color = ~nc_pal(log(mean_income)), fillOpacity = 1,
              label = ~paste0("Mean Income: ", dollar(mean_income)),
              highlightOptions = highlightOptions(weight = 5, color = "white", bringToFront = TRUE))
# Print the slot names of `wealthy_zips`
slotNames(wealthy_zips)
## [1] "data"        "polygons"    "plotOrder"   "bbox"        "proj4string"
# Print a summary of the `mean_income` variable
summary(wealthy_zips$mean_income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  200400  229900  279300  339900  371900 2554000
# plot zip codes with mean incomes >= $200k
wealthy_zips %>% 
  leaflet() %>% 
  addProviderTiles("CartoDB") %>% 
  addPolygons(weight = 1, fillOpacity = .7, color = "Green",  group = "Wealthy Zipcodes", 
              label = ~paste0("Mean Income: ", dollar(mean_income)),
              highlightOptions = highlightOptions(weight = 5, color = "white", bringToFront = TRUE))
# Add polygons using wealthy_zips
final_map <- m4 %>% 
   addPolygons(data = wealthy_zips, weight = 1, fillOpacity = .5, color = "Grey",  group = "Wealthy Zip Codes", 
              label = ~paste0("Mean Income: ", dollar(mean_income)),
              highlight = highlightOptions(weight = 5, color = "white", bringToFront = TRUE)) %>% 
    # Update layer controls including "Wealthy Zip Codes"
    addLayersControl(baseGroups = c("OSM", "Carto", "Esri"), 
                         overlayGroups = c("Public", "Private", "For-Profit", "Wealthy Zip Codes"))

# Print and explore your very last map of the course!
final_map

Support Vector Machines in R

Chapter 1 - Introduction

Sugar content of soft drinks:

  • Course covers Support Vector Machines (SVM), including visualization, mechanics, situations where they work best, etc.
    • Will stick with binary classification for this course
  • For a 1-dimensional dataset, the clusters can be separated by choosing a “separating boundary” (decision boundary)
  • Margins are the distances between the decision boundary and the closest point
    • The best decision boundary is considered to be the decision boundary that maximizes the margin (more robust to noise)
    • The SVM tries to find the decision boundary that maximizes the margin in n-dimensions

Generating a linearly separable dataset

  • Can use runif to generate random data that is unifotm from 0 to 1
    • n <- 200
    • set.seed(42)
    • df <- data.frame(x1 = runif(n), x2 = runif(n))
  • Can define the points with x1 < x2 as class A and the points with x1 > x2 as class B
    • Can also create a margin by filtering out points where abs(x1-x2) is below a user-specified threshold

Example code includes:

df <- data.frame(sample=1:25, 
                 sugar_content=c(10.9, 10.9, 10.6, 10, 8, 8.2, 8.6, 10.9, 10.7, 8, 7.7, 7.8, 8.4, 11.5, 11.2, 8.9, 8.7, 7.4, 10.9, 10, 11.4, 10.8, 8.5, 8.2, 10.6)
                 )
str(df)
## 'data.frame':    25 obs. of  2 variables:
##  $ sample       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sugar_content: num  10.9 10.9 10.6 10 8 8.2 8.6 10.9 10.7 8 ...
#print variable names
names(df)
## [1] "sample"        "sugar_content"
#build plot
plot_ <- ggplot(data = df, aes(x = sugar_content, y = c(0))) + 
    geom_point() + 
    geom_text(label = df$sugar_content, size = 2.5, vjust = 2, hjust = 0.5)

#display plot
plot_

#The maximal margin separator is at the midpoint of the two extreme points in each cluster.
mm_separator <- (8.9 + 10)/2


#create data frame
separator <- data.frame(sep = c(mm_separator))

#add ggplot layer 
plot_ <- plot_ + 
  geom_point(data = separator, x = separator$sep, y = c(0), color = "blue", size = 4)

#display plot
plot_

#set seed
set.seed(42)

#set number of data points. 
n <- 600

#Generate data frame with two uniformly distributed predictors lying between 0 and 1.
df <- data.frame(x1 = runif(n), x2 = runif(n))

#classify data points depending on location
df$y <- factor(ifelse(df$x2 - 1.4*df$x1 < 0, -1, 1), levels = c(-1, 1))


#set margin
delta <- 0.07

# retain only those points that lie outside the margin
df1 <- df[abs(1.4*df$x1 - df$x2) > delta, ]

#build plot
plot_margins <- ggplot(data = df1, aes(x = x1, y = x2, color = y)) + geom_point() + 
    scale_color_manual(values = c("red", "blue")) + 
    geom_abline(slope = 1.4, intercept = 0)+
    geom_abline(slope = 1.4, intercept = delta, linetype = "dashed") +
    geom_abline(slope = 1.4, intercept = -delta, linetype = "dashed")
 
#display plot 
plot_margins


Chapter 2 - Support Vector Classifiers - Linear Kernels

Linear Support Vector Machines:

  • Can split the data from the previous chapter (perfectly separable) in to train/test on an 80-20 basis
    • set.seed() = 1
    • df[, “train”] <- ifelse(runif(nrow(df))<0.8,1,0)
    • trainset <- df[df$train==1,]
    • testset <- df[df$train==0,]
    • trainColNum <- grep(“train”, names(trainset))
    • trainset <- trainset[,-trainColNum]
    • testset <- testset[,-trainColNum]
  • Decision boundaries have many shapes-types (called kernels) such as lines, polynomials, etc.
  • For this chapter, will use e1071::svm(), a function with many options
    • formula, data, type (“C-classification” for classification), kernel (“linear” for this chapter), cost/gamma (tuning parameters, which will be left at the defaults for now), scale (boolean telling whether to scale the data in advance - FALSE makes for easier plotting, but typically would be set to TRUE in the real-world)
  • Example of running e1071::svm()
    • library(e1071)
    • svm_model<- svm(y ~ ., data = trainset, type = “C-classification”, kernel = “linear”, scale = FALSE)
    • svm_model
    • svm_model$index # indices of the support vectors
    • svm_model$SV # support vector coordinates
    • svm_model$rho # negative y-intercept of the decision boundary
    • svm_model$coefs # weighting coefficients of support vectors (magnitude is importance, side is which part of boundary)
    • pred_train <- predict(svm_model,trainset)
    • pred_test <- predict(svm_model,testset)

Visualizing Linear SVM:

  • Can begin by plotting the training data, distinguished by color
    • p <- ggplot(data = trainset, aes(x = x1, y = x2, color = y)) + geom_point() + scale_color_manual(values = c(“red”,“blue”))
    • df_sv <- trainset[svm_model$index,]
    • p <- p + geom_point(data = df_sv, aes(x = x1, y = x2), color = “purple”, size = 4, alpha = 0.5)
    • p
  • The support vectors tend to be close to the decision boundary - in fact, they are defined as points that “support” the boundary
  • Goal is to extract the slope and coefficients from the model (not stored in the model object)
    • w <- t(svm_model\(coefs) %*% svm_model\)SV
    • slope_1 <- -w[1]/w[2]
    • intercept_1 <- svm_model$rho/w[2]
    • p <- p + geom_abline(slope = slope_1, intercept = intercept_1)
    • p <- p + geom_abline(slope = slope_1, intercept = intercept_1-1/w[2], linetype = “dashed”) + geom_abline(slope = slope_1, intercept = intercept_1+1/w[2], linetype = “dashed”)
    • p
  • There are several properties observed in the plot
    • The boundary is supported by the support vectors
    • The boundary is “soft”, which allows for uncertainty in location/shape of the boundary
    • Can also use e1071::plot(x=myModel, data=myData) to plot the function

Tuning Linear SVM:

  • Can tweak the cost parameter to change the size of the soft boundary for the SVM
    • Higher costs lead to harder (smaller, narrower) decision boundaries, with fewer support vectors
    • The implication is that raising the cost can be a good idea if the data are known to be linearly separable

Multi-class problems:

  • SVM can manage classification problems with 3+ target types also - using the example iris data
    • p <- ggplot(data = iris, aes(x = Petal.Width, y = Petal.Length, color = Species)) + geom_point()
    • p
  • The SVM at core is a binary classifier, but can be used in a multi-class setting
    • Have a model for each of the choose(m, 2) possible combinations, and use majority voting on the outputs (ties broken by random)
    • This method is called the “one against one” classification, and it is automatically included in e1071
    • svm_model<- svm(Species ~ ., data = trainset, type = “C-classification”, kernel = “linear”) # all run automatically

Example code includes:

dfOld <- df
delta <- 0.07
df <- df[abs(1.4*df$x1 - df$x2) > delta, ]


#split train and test data in an 80/20 proportion
df[, "train"] <- ifelse(runif(nrow(df))<0.8, 1, 0)

#assign training rows to data frame trainset
trainset <- df[df$train == 1, ]
#assign test rows to data frame testset
testset <- df[df$train == 0, ]

#find index of "train" column
trainColNum <- grep("train", names(df))

#remove "train" column from train and test dataset
trainset <- trainset[, -trainColNum]
testset <- testset[, -trainColNum]


library(e1071)

#build svm model, setting required parameters
svm_model<- svm(y ~ ., 
                data = trainset, 
                type = "C-classification", 
                kernel = "linear", 
                scale = FALSE)


#list components of model
names(svm_model)
##  [1] "call"            "type"            "kernel"         
##  [4] "cost"            "degree"          "gamma"          
##  [7] "coef0"           "nu"              "epsilon"        
## [10] "sparse"          "scaled"          "x.scale"        
## [13] "y.scale"         "nclasses"        "levels"         
## [16] "tot.nSV"         "nSV"             "labels"         
## [19] "SV"              "index"           "rho"            
## [22] "compprob"        "probA"           "probB"          
## [25] "sigma"           "coefs"           "na.action"      
## [28] "fitted"          "decision.values" "terms"
#list values of the SV, index and rho
svm_model$SV
##               x1          x2
## 11  0.4577417762 0.476919189
## 19  0.4749970816 0.486642912
## 45  0.4317512489 0.520339758
## 58  0.1712643304 0.100229354
## 61  0.6756072745 0.772399305
## 69  0.6932048204 0.838569788
## 99  0.7439746463 0.912029979
## 101 0.6262453445 0.765520479
## 103 0.2165673110 0.202548483
## 118 0.3556659538 0.298152283
## 143 0.4640695513 0.535269056
## 144 0.7793681615 0.941694443
## 147 0.1701624813 0.050030747
## 173 0.4140496817 0.380267640
## 176 0.1364903601 0.011009041
## 180 0.7690324257 0.951921815
## 194 0.1290892835 0.021196302
## 199 0.7431877197 0.824081728
## 204 0.4427962683 0.532290264
## 209 0.2524584394 0.281511990
## 226 0.8205145481 0.962842692
## 253 0.2697161783 0.288755647
## 268 0.2050496121 0.182046106
## 272 0.7853494422 0.870432480
## 278 0.4037828147 0.476424339
## 286 0.1709963905 0.164468810
## 294 0.3864540118 0.370921416
## 295 0.3324459905 0.382318948
## 325 0.5648222226 0.618285144
## 338 0.3169501573 0.333509587
## 341 0.4091320913 0.496387038
## 344 0.3597852497 0.345139100
## 393 0.6568108753 0.815567016
## 400 0.0755990995 0.007417523
## 406 0.1079870730 0.022227321
## 413 0.2401496081 0.151690785
## 427 0.4664852461 0.464965629
## 443 0.3626018071 0.369346223
## 450 0.0619409799 0.011438249
## 466 0.6399842701 0.695480783
## 479 0.1730011790 0.136427131
## 503 0.5195604505 0.627322678
## 525 0.6494539515 0.833293378
## 526 0.6903516576 0.790328991
## 535 0.4243346907 0.470753220
## 590 0.7148487861 0.902375512
## 595 0.8058112133 0.937903824
## 600 0.4587231132 0.446819442
## 15  0.4622928225 0.839631285
## 29  0.4469696281 0.721333573
## 37  0.0073341469 0.108096598
## 38  0.2076589728 0.519075874
## 59  0.2610879638 0.472588875
## 90  0.3052183695 0.548420829
## 92  0.0002388966 0.122946701
## 102 0.2171576982 0.505044580
## 104 0.3889450287 0.717138722
## 129 0.2335235255 0.439058027
## 132 0.6034740848 0.958318281
## 133 0.6315072989 0.970767964
## 158 0.0290858189 0.148069276
## 175 0.4274944656 0.725024226
## 178 0.5923042425 0.900228734
## 189 0.1333296183 0.390023998
## 196 0.0531294835 0.276241161
## 202 0.5171110556 0.899924811
## 210 0.2596899802 0.503687580
## 215 0.4513108502 0.743930877
## 229 0.0483467767 0.218475638
## 232 0.1590223818 0.402696270
## 237 0.0865806018 0.263718613
## 239 0.5545858634 0.935806216
## 249 0.4992728804 0.812805236
## 258 0.5397982858 0.932383237
## 276 0.3367135401 0.672058288
## 293 0.3152607968 0.625878707
## 309 0.3199476011 0.541676977
## 311 0.1078112544 0.374908455
## 378 0.1084886545 0.376079086
## 409 0.0842775232 0.235715229
## 419 0.4264662997 0.798480970
## 420 0.0600483362 0.298929408
## 430 0.5141573721 0.908452330
## 451 0.4309255683 0.821331850
## 477 0.5964720468 0.913432184
## 481 0.2329343846 0.409654615
## 482 0.5770482090 0.969947845
## 488 0.2485451805 0.533491509
## 520 0.5784583788 0.907620618
## 524 0.1270027745 0.348539336
## 530 0.2665205784 0.458110426
## 540 0.2131855546 0.530223881
## 558 0.2770604359 0.510976796
## 562 0.2056735931 0.433566746
## 580 0.5705413527 0.994652604
## 581 0.2458533479 0.494881822
svm_model$index
##  [1]   8  11  30  39  42  48  71  73  75  85 103 104 105 124 127 131 141
## [18] 145 149 153 167 188 196 199 204 208 215 216 240 251 254 257 290 297
## [35] 300 306 317 329 335 347 353 372 390 391 398 440 443 445   9  18  23
## [52]  24  40  63  65  74  76  92  95  96 114 126 129 137 143 148 154 157
## [69] 170 173 178 180 186 192 202 214 228 229 281 303 310 311 318 336 351
## [86] 355 356 362 385 389 395 402 415 419 431 432
svm_model$rho
## [1] -0.1641859
#compute training accuracy
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)
## [1] 1
#compute test accuracy
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)
## [1] 1
#build scatter plot of training dataset
scatter_plot <- ggplot(data = trainset, aes(x = x1, y = x2, color = y)) + 
    geom_point() + 
    scale_color_manual(values = c("red", "blue"))
 
#add plot layer marking out the support vectors 
layered_plot <- 
    scatter_plot + geom_point(data = trainset[svm_model$index, ], aes(x = x1, y = x2), color = "purple", size = 4, alpha = 0.5)

#display plot
layered_plot

#calculate slope and intercept of decision boundary from weight vector and svm model
w <- c(x1=6.55241, x2=-4.73278)  # calculated manually outside of this module
slope_1 <- -w[1]/w[2]
intercept_1 <- svm_model$rho/w[2]

#build scatter plot of training dataset
scatter_plot <- ggplot(data = trainset, aes(x = x1, y = x2, color = y)) + 
    geom_point() + scale_color_manual(values = c("red", "blue"))
#add decision boundary
plot_decision <- scatter_plot + geom_abline(slope = slope_1, intercept = intercept_1) 
#add margin boundaries
plot_margins <- plot_decision + 
 geom_abline(slope = slope_1, intercept = intercept_1 - 1/w[2], linetype = "dashed")+
 geom_abline(slope = slope_1, intercept = intercept_1 + 1/w[2], linetype = "dashed")
#display plot
plot_margins

#build svm model
svm_model<- 
    svm(y ~ ., data = trainset, type = "C-classification", 
        kernel = "linear", scale = FALSE)

#plot decision boundaries and support vectors
plot(x = svm_model, data = trainset)

#build svm model, cost = 1
svm_model_1 <- svm(y ~ .,
                   data = trainset,
                   type = "C-classification",
                   cost = 1,
                   kernel = "linear",
                   scale = FALSE)

#print model details
svm_model_1
## 
## Call:
## svm(formula = y ~ ., data = trainset, type = "C-classification", 
##     cost = 1, kernel = "linear", scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  96
#build svm model, cost = 100
svm_model_100 <- svm(y ~ .,
                   data = trainset,
                   type = "C-classification",
                   cost = 100,
                   kernel = "linear",
                   scale = FALSE)

#print model details
svm_model_100
## 
## Call:
## svm(formula = y ~ ., data = trainset, type = "C-classification", 
##     cost = 100, kernel = "linear", scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  100 
##       gamma:  0.5 
## 
## Number of Support Vectors:  6
# Create the base train_plot
train_plot <- ggplot(data = trainset, aes(x = x1, y = x2, color = y)) + 
    geom_point() + scale_color_manual(values = c("red", "blue"))
w_1 <- c(x1=6.55241, x2=-4.73278)  # calculated manually outside of this module
w_100 <- c(x1=18.3097, x2=-13.09972)  # calculated manually outside of this module
intercept_1 <- -0.005515526  # calculated outside of this module
intercept_100 <- 0.001852543  # calculated outside of this module
slope_1 <- -w_1[1]/w_1[2]
slope_100 <- -w_100[1]/w_100[2]


#add decision boundary and margins for cost = 1 to training data scatter plot
train_plot_with_margins <- train_plot + 
    geom_abline(slope = slope_1, intercept = intercept_1) +
    geom_abline(slope = slope_1, intercept = intercept_1 - 1/w_1[2], linetype = "dashed")+
    geom_abline(slope = slope_1, intercept = intercept_1 + 1/w_1[2], linetype = "dashed")

#display plot
train_plot_with_margins

#add decision boundary and margins for cost = 100 to training data scatter plot
train_plot_with_margins <- train_plot_with_margins + 
    geom_abline(slope = slope_100, intercept = intercept_100, color = "goldenrod") +
    geom_abline(slope = slope_100, intercept = intercept_100 - 1/w_100[2], linetype = "dashed", color = "goldenrod")+
    geom_abline(slope = slope_100, intercept = intercept_100 + 1/w_100[2], linetype = "dashed", color = "goldenrod")

#display plot 
train_plot_with_margins

svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "linear", scale = FALSE)

#compute training accuracy
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)
## [1] 1
#compute test accuracy
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)
## [1] 1
#plot
plot(svm_model, trainset)

data(iris)
nTrials <- 100
accuracy <- numeric(nTrials)

#calculate accuracy for n distinct 80/20 train/test partitions
for (i in 1:nTrials){ 
    iris[, "train"] <- ifelse(runif(nrow(iris))<0.8, 1, 0)
    trainColNum <- grep("train", names(iris))
    trainset <- iris[iris$train == 1, -trainColNum]
    testset <- iris[iris$train == 0, -trainColNum]
    svm_model <- svm(Species~ ., data = trainset, 
                     type = "C-classification", kernel = "linear")
    pred_test <- predict(svm_model, testset)
    accuracy[i] <- mean(pred_test == testset$Species)
}

#mean accuracy and standard deviation
mean(accuracy) 
## [1] 0.9643194
sd(accuracy)
## [1] 0.03704363

Chapter 3 - Polynomial Kernels

Generating radially separable datasets:

  • The goal is to generate 2D points (again uniformly distributed on x1 and x2 using runif)
  • Can then define a value for whether the points are within x of the center
    • radius <- 0.7
    • radius_squared <- radius^2
    • df\(y <- factor(ifelse(df\)x1^2 + df$x2^2 < radius_squared, -1, 1), levels = c(-1,1))
    • p <- ggplot(data = df, aes(x = x1, y = x2, color = y)) + geom_point() + scale_color_manual(values = c(“-1” = “red”,“1” = “blue”))
    • p
  • Can add a circular boundary
    • circle <- function(x1_center, x2_center, r, npoint = 100){
    • #angular spacing of 2*pi/npoint between points
    • theta <- seq(0,2*pi,length.out = npoint)
    • x1_circ <- x1_center + r * cos(theta)
    • x2_circ <- x2_center + r * sin(theta)
    • return(data.frame(x1c = x1_circ, x2c = x2_circ))
    • }
    • boundary <- circle(x1_center = 0, x2_center = 0, r = radius)
    • p <- p + geom_path(data = boundary, aes(x = x1c, y = x2c), inherit.aes = FALSE)

Linear SVM on radially separable datasets:

  • The linear SVM will perform poorly on the radially separable dataset
    • svm_model<- svm(y ~ ., data=trainset, type=“C-classification”, kernel=“linear”)
    • svm_model
    • pred_test <- predict(svm_model,testset)
    • plot(svm_model,trainset) # all points are classified as 1

Kernel trick - devise a mathematical transformation that makes the data linearly separable:

  • For a circles could map X1 = x12 and X2 = x22, where X1 + X2 = 0.49 (which is linearly separable)
  • The polynomial kernel has a degree (e.g., 1 for linear, 2 for quadratic, etc.) and tuning parameters gamma and coef0
    • The kernel also uses u dot v where u and v are vectors belonging to the dataset
    • (gamma * (u dot v) + coef0) ** degree
  • Applying the quadratic kernel to the circular data from above
    • svm_model<- svm(y ~ ., data = trainset, type = “C-classification”, kernel = “polynomial”, degree = 2)
    • plot(svm_model, trainset)

Tuning SVM:

  • Set a search range for each parameter, typically as a sequence of variable (e.g., in multiples of 10)
  • For each combination of parameters, build an SVM and assess the out-of-sample accuracy - can become computationally intensive, though
    • tune_out <- tune.svm(x = trainset[,-3], y = trainset[,3], type = “C-classification”, kernel = “polynomial”, degree = 2, cost = 10^(-1:2), gamma = c(0.1,1,10), coef0 = c(0.1,1,10))
    • tune_out\(best.parameters\)cost
    • tune_out\(best.parameters\)gamma
    • tune_out\(best.parameters\)coef0
    • svm_model <- svm(y~ ., data = trainset, type = “C-classification”, kernel = “polynomial”, degree = 2, cost = tune_out\(best.parameters\)cost, gamma = tune_out\(best.parameters\)gamma, coef0 = tune_out\(best.parameters\)coef0)

Example code includes:

#set number of variables and seed
n <- 400
set.seed(1)

#Generate data frame with two uniformly distributed predictors, x1 and x2
df <- data.frame(x1 = runif(n, min = -1, max = 1), x2 = runif(n, min = -1, max = 1))

#We want a circular boundary. Set boundary radius 
radius <- 0.8
radius_squared <- radius^2

#create dependent categorical variable, y, with value -1 or 1 depending on whether point lies
#within or outside the circle.
df$y <- factor(ifelse(df$x1**2 + df$x2**2 < radius_squared, -1, 1), levels = c(-1, 1))


#build scatter plot, distinguish class by color
scatter_plot <- ggplot(data = df, aes(x = x1, y = x2, color = y)) + 
    geom_point() +
    scale_color_manual(values = c("red", "blue"))

#display plot
scatter_plot

inTrain <- sample(1:nrow(df), round(0.75*nrow(df)), replace=FALSE)
trainset <- df[sort(inTrain), ]
testset <- df[-inTrain, ]


#default cost mode;
svm_model_1 <- svm(y ~ ., data = trainset, type = "C-classification", cost = 1, kernel = "linear")

#training accuracy
pred_train <- predict(svm_model_1, trainset)
mean(pred_train == trainset$y)
## [1] 0.64
#test accuracy
pred_test <- predict(svm_model_1, testset)
mean(pred_test == testset$y)
## [1] 0.48
#cost = 100 model
svm_model_100 <- svm(y ~ ., data = trainset, type = "C-classification", cost = 100, kernel = "linear")

#accuracy
pred_train <- predict(svm_model_100, trainset)
mean(pred_train == trainset$y)
## [1] 0.64
pred_test <- predict(svm_model_100, testset)
mean(pred_test == testset$y)
## [1] 0.48
#print average accuracy and standard deviation
accuracy <- rep(NA, 100)
set.seed(2)

#comment
for (i in 1:100){
    df[, "train"] <- ifelse(runif(nrow(df))<0.8, 1, 0)
    trainset <- df[df$train == 1, ]
    testset <- df[df$train == 0, ]
    trainColNum <- grep("train", names(trainset))
    trainset <- trainset[, -trainColNum]
    testset <- testset[, -trainColNum]
    svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "linear")
    pred_test <- predict(svm_model, testset)
    accuracy[i] <- mean(pred_test == testset$y)
}

#print average accuracy and standard deviation
mean(accuracy)
## [1] 0.5554571
sd(accuracy)
## [1] 0.04243524
#transform data
df1 <- data.frame(x1sq = df$x1^2, x2sq = df$x2^2, y = df$y)

#plot data points in the transformed space
plot_transformed <- ggplot(data = df1, aes(x = x1sq, y = x2sq, color = y)) + 
    geom_point()+ guides(color = FALSE) + 
    scale_color_manual(values = c("red", "blue"))

#add decision boundary and visualize
plot_decision <- plot_transformed + geom_abline(slope = -1, intercept = 0.64)
plot_decision

# Still want to use the old (non-squared) data
inTrain <- sample(1:nrow(df), round(0.75*nrow(df)), replace=FALSE)
df$train <- NULL
trainset <- df[sort(inTrain), ]
testset <- df[-inTrain, ]

svm_model <- svm(y ~ ., data = trainset, type = "C-classification", kernel = "polynomial", degree = 2)

#measure training and test accuracy
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)
## [1] 0.9866667
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)
## [1] 0.98
#plot
plot(svm_model, trainset)

#tune model
tune_out <- 
    tune.svm(x = trainset[, -3], y = trainset[, 3], 
             type = "C-classification", 
             kernel = "polynomial", degree = 2, cost = 10^(-1:2), 
             gamma = c(0.1, 1, 10), coef0 = c(0.1, 1, 10))

#list optimal values
tune_out$best.parameters$cost
## [1] 0.1
tune_out$best.parameters$gamma
## [1] 10
tune_out$best.parameters$coef0
## [1] 0.1
#Build tuned model
svm_model <- svm(y ~ ., data = trainset, type = "C-classification", 
                 kernel = "polynomial", degree = 2, 
                 cost = tune_out$best.parameters$cost, 
                 gamma = tune_out$best.parameters$gamma, 
                 coef0 = tune_out$best.parameters$coef0)

#Calculate training and test accuracies
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)
## [1] 0.9966667
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)
## [1] 1
#plot model
plot(svm_model, trainset)


Chapter 4 - Radial Basis Kernel Functions

Generating complex datasets:

  • The RBF kernel is highly flexible, can fit complex boundaries, and is common in the real-world
  • Can generate complex data by using different distributions for x and y
    • n <- 600
    • set.seed(42)
    • df <- data.frame(x1 = rnorm(n, mean = -0.5, sd = 1), x2 = runif(n, min = -1, max = 1))
  • The decision boundary can then be two circles that just barely touch at the origin
    • radius <- 0.7
    • radius_squared <- radius^2
    • center_1 <- c(-0.7,0)
    • center_2 <- c(0.7,0)
    • df\(y <- factor(ifelse( (df\)x1-center_1[1])^2 + (df\(x2-center_1[2])^2 < radius_squared| (df\)x1-center_2[1])^2 + (df$x2-center_2[2])^2 < radius_squared, -1,1), levels = c(-1,1))
    • p <- ggplot(data = df, aes(x = x1, y = x2, color = y)) + geom_point() + guides(color = FALSE) + scale_color_manual(values = c(“red”,“blue”))
    • p
  • Can then build linear, polynomial, and RBF kernels to model the data

Motivating the RBF kernel:

  • Neither the linear kernel nor the polynomial kernel will work well for the dataset as described
  • Can use the heuristic that points near each other probably belong to the same class (similar to kNN)
    • The kernel should have a maximum at (a, b), and should decay as you move away from (a, b)
    • The rate of decay, all else equal should be the same in all directions, with a tunable gamma
    • As good fortune has it, the exponential exp(-gamma * r) has all of these properties
    • rbf <- function(r, gamma) exp(-gamma*r)
    • ggplot(data.frame(r = c(-0, 10)), aes(r)) +
    • stat_function(fun = rbf, args = list(gamma = 0.2), aes(color = “0.2”)) +
    • stat_function(fun = rbf, args = list(gamma = 0.4), aes(color = “0.4”)) +
    • stat_function(fun = rbf, args = list(gamma = 0.6), aes(color = “0.6”)) +
    • stat_function(fun = rbf, args = list(gamma = 0.8), aes(color = “0.8”)) +
    • stat_function(fun = rbf, args = list(gamma = 1), aes(color = “1”)) +
    • stat_function(fun = rbf, args = list(gamma = 2), aes(color = “2”)) +
    • scale_color_manual(“gamma”, values = c(“red”,“orange”,“yellow”, “green”,“blue”,“violet”)) +
    • ggtitle(“Radial basis function (gamma=0.2 to 2)”)

The RBF kernel simulates some of the principles of kNN using exponential decay:

  • The RBF kernel can be built using pre-set R commands
    • svm_model<- svm(y ~ ., data = trainset, type = “C-classification”, kernel = “radial”)
  • The predicted decision boundary will no longer be linear, and can be refined through tuning
    • tune_out <- tune.svm(x = trainset[,-3], y = trainset[,3], gamma = 5*10^(-2:2), cost = c(0.01,0.1,1,10,100), type = “C-classification”, kernel = “radial”)
    • tune_out\(best.parameters\)cost
    • tune_out\(best.parameters\)gamma
    • svm_model <- svm(y~ ., data=trainset, type=“C-classification”, kernel=“radial”, cost=tune_out\(best.parameters\)cost, gamma=tune_out\(best.parameters\)gamma)

Example code includes:

#number of data points
n <- 1000

#set seed
set.seed(1)

#create dataframe
df <- data.frame(x1 = rnorm(n, mean = -0.5, sd = 1), x2 = runif(n, min = -1, max = 1))


#set radius and centers
radius <- 0.8
center_1 <- c(-0.8, 0)
center_2 <- c(0.8, 0)
radius_squared <- radius^2

#create binary classification variable
df$y <- factor(ifelse((df$x1-center_1[1])^2 + (df$x2-center_1[2])^2 < radius_squared |
                      (df$x1-center_2[1])^2 + (df$x2-center_2[2])^2 < radius_squared, -1, 1),
                      levels = c(-1, 1))


#create scatter plot
scatter_plot<- ggplot(data = df, aes(x = x1, y = x2, color = y)) + 
    geom_point() + 
    scale_color_manual(values = c("red", "blue"))
 
scatter_plot 

# Create 75/25 split for train/test
inTrain <- sample(1:nrow(df), round(0.75*nrow(df)), replace=FALSE)
trainset <- df[sort(inTrain), ]
testset <- df[-inTrain, ]


#build model
svm_model <- svm(y ~ ., data = trainset, type = "C-classification", kernel = "linear")

#accuracy
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)
## [1] 0.5853333
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)
## [1] 0.564
#plot model against testset
plot(svm_model, testset)

#build model
svm_model <- svm(y ~ ., data = trainset, type = "C-classification", kernel = "polynomial", degree = 2)

#accuracy
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)
## [1] 0.8253333
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)
## [1] 0.788
#plot model
plot(svm_model, trainset)

#create vector to store accuracies and set random number seed
accuracy <- rep(NA, 100)
set.seed(2)


# Create a dummy frame dfDum for use in the for loop
dfDum <- df

#calculate accuracies for 100 training/test partitions
for (i in 1:100){
    dfDum[, "train"] <- ifelse(runif(nrow(dfDum))<0.8, 1, 0)
    trainset <- dfDum[dfDum$train == 1, ]
    testset <- dfDum[dfDum$train == 0, ]
    trainColNum <- grep("train", names(trainset))
    trainset <- trainset[, -trainColNum]
    testset <- testset[, -trainColNum]
    svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "polynomial", degree = 2)
    pred_test <- predict(svm_model, testset)
    accuracy[i] <- mean(pred_test == testset$y)
}

#print average accuracy and standard deviation
mean(accuracy)
## [1] 0.804765
sd(accuracy)
## [1] 0.02398396
#create vector to store accuracies and set random number seed
accuracy <- rep(NA, 100)
set.seed(2)

#calculate accuracies for 100 training/test partitions
for (i in 1:100){
    dfDum[, "train"] <- ifelse(runif(nrow(dfDum))<0.8, 1, 0)
    trainset <- dfDum[dfDum$train == 1, ]
    testset <- dfDum[dfDum$train == 0, ]
    trainColNum <- grep("train", names(trainset))
    trainset <- trainset[, -trainColNum]
    testset <- testset[, -trainColNum]
    svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "radial")
    pred_test <- predict(svm_model, testset)
    accuracy[i] <- mean(pred_test == testset$y)
}

#print average accuracy and standard deviation
mean(accuracy)
## [1] 0.9034203
sd(accuracy)
## [1] 0.01786378
# Re-create original 75/25 split for train/test
inTrain <- sample(1:nrow(df), round(0.75*nrow(df)), replace=FALSE)
trainset <- df[sort(inTrain), ]
testset <- df[-inTrain, ]

#tune model
tune_out <- tune.svm(x = trainset[, -3], y = trainset[, 3], 
                     gamma = 5*10^(-2:2), 
                     cost = c(0.01, 0.1, 1, 10, 100), 
                     type = "C-classification", kernel = "radial")
tune_out
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  gamma cost
##      5    1
## 
## - best performance: 0.04
#build tuned model
svm_model <- svm(y~ ., data = trainset, type = "C-classification", kernel = "radial", 
                 cost = tune_out$best.parameters$cost, 
                 gamma = tune_out$best.parameters$gamma)

#calculate test accuracy
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)
## [1] 0.956
#Plot decision boundary against test data
plot(svm_model, testset)


Experimental Design in R

Chapter 1 - Introduction to Experimental Design

Introduction to experimental design:

  • Experiments start with a question in mind, then finding and analyzing data
  • This course will use open data, meaning that we do not know the original experimental design
  • Key conditions of an experiment include randomization, replication, and blocking

Hypothesis testing:

  • The null hypothesis changes depending on the question of interest - “no effect” (two-sided) or “no positive effect” (one-sided) or etc.
  • Power is the probability that the test correctly reject the null hypothesis when the alternative hypothesis is true (target >= 80%)
  • The effect size is the standardized measure of the difference that you are trying to detect
  • Sample size is generally chosen so that the effect size can be measured at the required power
  • Example of using the power package for calculating the metrics
    • library(pwr)
    • pwr.anova.test(k = 3, n = 20, f = 0.2, sig.level = 0.05, power = NULL) # one must be entered as NULL (this will be calculated) ; k groups with n per group and f effect size

Example code includes:

# load the ToothGrowth dataset
data("ToothGrowth")

#perform a two-sided t-test
t.test(x = ToothGrowth$len, alternative = "two.sided", mu = 18)
## 
##  One Sample t-test
## 
## data:  ToothGrowth$len
## t = 0.82361, df = 59, p-value = 0.4135
## alternative hypothesis: true mean is not equal to 18
## 95 percent confidence interval:
##  16.83731 20.78936
## sample estimates:
## mean of x 
##  18.81333
#perform a t-test
ToothGrowth_ttest <- t.test(len ~ supp, data = ToothGrowth)

#tidy the t-test model object
broom::tidy(ToothGrowth_ttest)
##   estimate estimate1 estimate2 statistic    p.value parameter   conf.low
## 1      3.7  20.66333  16.96333  1.915268 0.06063451  55.30943 -0.1710156
##   conf.high                  method alternative
## 1  7.571016 Welch Two Sample t-test   two.sided
#group by supp, dose, then examine how many observations in ToothGrowth there are by those groups
ToothGrowth %>% 
    group_by(supp, dose) %>% 
    summarize(n=n())
## # A tibble: 6 x 3
## # Groups:   supp [?]
##   supp   dose     n
##   <fct> <dbl> <int>
## 1 OJ    0.500    10
## 2 OJ    1.00     10
## 3 OJ    2.00     10
## 4 VC    0.500    10
## 5 VC    1.00     10
## 6 VC    2.00     10
#create a boxplot with geom_boxplot()
ggplot(ToothGrowth, aes(x=as.factor(dose), y=len)) + 
    geom_boxplot()

#create the ToothGrowth_aov model object
ToothGrowth_aov <- aov(len ~ dose + supp, data = ToothGrowth)

#examine the model object with summary()
summary(ToothGrowth_aov)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## dose         1 2224.3  2224.3  123.99 6.31e-16 ***
## supp         1  205.3   205.3   11.45   0.0013 ** 
## Residuals   57 1022.6    17.9                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#less than
t.test(x = ToothGrowth$len, alternative = "less", mu = 18)
## 
##  One Sample t-test
## 
## data:  ToothGrowth$len
## t = 0.82361, df = 59, p-value = 0.7933
## alternative hypothesis: true mean is less than 18
## 95 percent confidence interval:
##      -Inf 20.46358
## sample estimates:
## mean of x 
##  18.81333
#greater than
t.test(x = ToothGrowth$len, alternative = "greater", mu = 18)
## 
##  One Sample t-test
## 
## data:  ToothGrowth$len
## t = 0.82361, df = 59, p-value = 0.2067
## alternative hypothesis: true mean is greater than 18
## 95 percent confidence interval:
##  17.16309      Inf
## sample estimates:
## mean of x 
##  18.81333
#calculate power
pwr::pwr.t.test(n = 100, d = 0.35, sig.level = 0.10, type = "two.sample", 
                alternative = "two.sided", power = NULL
                )
## 
##      Two-sample t test power calculation 
## 
##               n = 100
##               d = 0.35
##       sig.level = 0.1
##           power = 0.7943532
##     alternative = two.sided
## 
## NOTE: n is number in *each* group
#calculate sample size
pwr::pwr.t.test(n = NULL, d = 0.25, sig.level = 0.05, 
                type = "one.sample", alternative = "greater", power = 0.8
                )
## 
##      One-sample t test power calculation 
## 
##               n = 100.2877
##               d = 0.25
##       sig.level = 0.05
##           power = 0.8
##     alternative = greater

Chapter 2 - Basic Experiments

Single and Multiple Factor Experiments:

  • The ANOVA (Analysis of Variance) test allows for comparing means across 3-groups; is at least one mean different
    • model_1 <- lm(y ~ x, data = dataset) # first option is lm followed by aov
    • anova(model_1) # first option is lm followed by anova
    • aov(y ~ x, data = dataset) # second option is a straight call to aov
  • The multiple factor experiment includes additional potential explanatory variables
    • model2 <- lm(y ~ x + r + s + t)
    • anova(model2)
  • The Lending Club data is 890k x 75, and contains data from a lending company

Model Validation:

  • EDA is an important step prior to modeling the data
  • Boxplots can be a helpful way to explore the data
    • ggplot(data = lendingclub, aes(x = verification_status, y = funded_amnt)) + geom_boxplot()
  • ANOVA and other linear models generally assume that the residuals are normally distributed

A/B Testing:

  • A/B tests are a type of controlled experiment with only two variants of something
  • Power and sample size are crucial to A/B testing, allowing for an understanding of the required size for a desired power and expected effect size

Example code includes:

lendingclub <- readr::read_csv("./RInputFiles/lendclub.csv")
## Parsed with column specification:
## cols(
##   member_id = col_integer(),
##   loan_amnt = col_integer(),
##   funded_amnt = col_integer(),
##   term = col_character(),
##   int_rate = col_double(),
##   emp_length = col_character(),
##   home_ownership = col_character(),
##   annual_inc = col_double(),
##   verification_status = col_character(),
##   loan_status = col_character(),
##   purpose = col_character(),
##   grade = col_character()
## )
#examine the variables with glimpse()
glimpse(lendingclub)
## Observations: 1,500
## Variables: 12
## $ member_id           <int> 55096114, 1555332, 1009151, 69524202, 7212...
## $ loan_amnt           <int> 11000, 10000, 13000, 5000, 18000, 14000, 8...
## $ funded_amnt         <int> 11000, 10000, 13000, 5000, 18000, 14000, 8...
## $ term                <chr> "36 months", "36 months", "60 months", "36...
## $ int_rate            <dbl> 12.69, 6.62, 10.99, 12.05, 5.32, 16.99, 13...
## $ emp_length          <chr> "10+ years", "10+ years", "3 years", "10+ ...
## $ home_ownership      <chr> "RENT", "MORTGAGE", "MORTGAGE", "MORTGAGE"...
## $ annual_inc          <dbl> 51000, 40000, 78204, 51000, 96000, 47000, ...
## $ verification_status <chr> "Not Verified", "Verified", "Not Verified"...
## $ loan_status         <chr> "Current", "Fully Paid", "Fully Paid", "Cu...
## $ purpose             <chr> "debt_consolidation", "debt_consolidation"...
## $ grade               <chr> "C", "A", "B", "C", "A", "D", "C", "A", "D...
#find median loan_amt, mean int_rate, and mean annual_inc with summarise()
lendingclub %>% summarise(median(loan_amnt), mean(int_rate), mean(annual_inc))
## # A tibble: 1 x 3
##   `median(loan_amnt)` `mean(int_rate)` `mean(annual_inc)`
##                 <dbl>            <dbl>              <dbl>
## 1               13000             13.3              75736
# use ggplot2 to build a bar chart of purpose
ggplot(data=lendingclub, aes(x = purpose)) + geom_bar()

#use recode() to create the new purpose_recode variable.
lendingclub$purpose_recode <- lendingclub$purpose %>% recode( 
        "credit_card" = "debt_related",
        "debt_consolidation" = "debt_related", 
        "medical" = "debt_related",
        "car" = "big_purchase", 
        "major_purchase" = "big_purchase", 
        "vacation" = "big_purchase",
        "moving" = "life_change", 
        "small_business" = "life_change", 
        "wedding" = "life_change",
        "house" = "home_related", 
        "home_improvement" = "home_related"
        )


#build a linear regression model, stored as purpose_recode_model
purpose_recode_model <- lm(funded_amnt ~ purpose_recode, data = lendingclub)

#look at results of purpose_recode_model
summary(purpose_recode_model)
## 
## Call:
## lm(formula = funded_amnt ~ purpose_recode, data = lendingclub)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14472  -6251  -1322   4678  25761 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      9888.1     1248.9   7.917 4.69e-15 ***
## purpose_recodedebt_related       5433.5     1270.5   4.277 2.02e-05 ***
## purpose_recodehome_related       4845.0     1501.0   3.228  0.00127 ** 
## purpose_recodelife_change        4095.3     2197.2   1.864  0.06254 .  
## purpose_recodeother              -649.3     1598.3  -0.406  0.68461    
## purpose_recoderenewable_energy  -1796.4     4943.3  -0.363  0.71636    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8284 on 1494 degrees of freedom
## Multiple R-squared:  0.03473,    Adjusted R-squared:  0.0315 
## F-statistic: 10.75 on 5 and 1494 DF,  p-value: 3.598e-10
#get anova results and save as purpose_recode_anova
purpose_recode_anova <- anova(purpose_recode_model)

# look at the class of purpose_recode_anova
class(purpose_recode_anova)
## [1] "anova"      "data.frame"
#Use aov() to build purpose_recode_aov
purpose_recode_aov <- aov(funded_amnt ~ purpose_recode, data = lendingclub)

#Conduct Tukey's HSD test to create tukey_output
tukey_output <- TukeyHSD(purpose_recode_aov)

#tidy tukey_output to make sense of the results
broom::tidy(tukey_output)
##              term                    comparison   estimate    conf.low
## 1  purpose_recode     debt_related-big_purchase  5433.5497   1808.4015
## 2  purpose_recode     home_related-big_purchase  4845.0126    562.0345
## 3  purpose_recode      life_change-big_purchase  4095.2652  -2174.3728
## 4  purpose_recode            other-big_purchase  -649.3001  -5209.7754
## 5  purpose_recode renewable_energy-big_purchase -1796.4015 -15901.7471
## 6  purpose_recode     home_related-debt_related  -588.5371  -3055.5905
## 7  purpose_recode      life_change-debt_related -1338.2845  -6539.3240
## 8  purpose_recode            other-debt_related -6082.8498  -9005.2437
## 9  purpose_recode renewable_energy-debt_related -7229.9512 -20893.8901
## 10 purpose_recode      life_change-home_related  -749.7475  -6428.9211
## 11 purpose_recode            other-home_related -5494.3127  -9201.4124
## 12 purpose_recode renewable_energy-home_related -6641.4141 -20494.4076
## 13 purpose_recode             other-life_change -4744.5652 -10635.8339
## 14 purpose_recode  renewable_energy-life_change -5891.6667 -20481.7279
## 15 purpose_recode        renewable_energy-other -1147.1014 -15088.3877
##    conf.high  adj.p.value
## 1   9058.698 2.905254e-04
## 2   9127.991 1.606975e-02
## 3  10364.903 4.250779e-01
## 4   3911.175 9.985823e-01
## 5  12308.944 9.991732e-01
## 6   1878.516 9.840871e-01
## 7   3862.755 9.776960e-01
## 8  -3160.456 5.321224e-08
## 9   6433.988 6.580011e-01
## 10  4929.426 9.990158e-01
## 11 -1787.213 3.576380e-04
## 12  7211.579 7.462798e-01
## 13  1146.703 1.953886e-01
## 14  8698.395 8.592034e-01
## 15 12794.185 9.999029e-01
#Use aov() to build purpose_emp_aov
purpose_emp_aov <- aov(funded_amnt ~ purpose_recode + emp_length, data=lendingclub)

#print purpose_emp_aov to the console
purpose_emp_aov
## Call:
##    aov(formula = funded_amnt ~ purpose_recode + emp_length, data = lendingclub)
## 
## Terms:
##                 purpose_recode   emp_length    Residuals
## Sum of Squares      3688783338   2044273211 100488872355
## Deg. of Freedom              5           11         1483
## 
## Residual standard error: 8231.679
## Estimated effects may be unbalanced
#call summary() to see the p-values
summary(purpose_emp_aov)
##                  Df    Sum Sq   Mean Sq F value   Pr(>F)    
## purpose_recode    5 3.689e+09 737756668  10.888 2.63e-10 ***
## emp_length       11 2.044e+09 185843019   2.743  0.00161 ** 
## Residuals      1483 1.005e+11  67760534                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#examine the summary of int_rate
summary(lendingclub$int_rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.32    9.99   12.99   13.31   16.29   26.77
#examine int_rate by grade
lendingclub %>% 
    group_by(grade) %>% 
    summarise(mean = mean(int_rate), var = var(int_rate), median = median(int_rate))
## # A tibble: 7 x 4
##   grade  mean   var median
##   <chr> <dbl> <dbl>  <dbl>
## 1 A      7.27 0.961   7.26
## 2 B     10.9  2.08   11.0 
## 3 C     14.0  1.42   14.0 
## 4 D     17.4  1.62   17.6 
## 5 E     20.1  2.71   20.0 
## 6 F     23.6  2.87   23.5 
## 7 G     26.1  0.198  25.9
#make a boxplot of int_rate by grade
ggplot(lendingclub, aes(x = grade, y = int_rate)) + geom_boxplot()

#use aov() to create grade_aov plus call summary() to print results
grade_aov <- aov(int_rate ~ grade, data = lendingclub)
summary(grade_aov)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## grade          6  27013    4502    2637 <2e-16 ***
## Residuals   1493   2549       2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#for a 2x2 grid of plots:
par(mfrow=c(2, 2))

#plot grade_aov
plot(grade_aov)

#back to defaults
par(mfrow=c(1, 1))

#Bartlett's test for homogeneity of variance
bartlett.test(int_rate ~ grade, data=lendingclub)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  int_rate by grade
## Bartlett's K-squared = 78.549, df = 6, p-value = 7.121e-15
#use the correct function from pwr to find the sample size
pwr::pwr.t.test(n=NULL, d=0.2, sig.level=0.05, 
                type="two.sample", alternative="two.sided", power=0.8
                )
## 
##      Two-sample t test power calculation 
## 
##               n = 393.4057
##               d = 0.2
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group
lc_A <- c(11976148, 1203719, 54998739, 5801830, 31587242, 7711391, 54494666, 57663583, 8967787, 21760921, 44765721, 8596988, 5794746, 59501253, 10578432, 36058744, 11727607, 357888, 51936863, 1178593, 57315811, 5705168, 46024211, 12947039, 57345207, 55299831, 28763037, 49763149, 20077511, 60216198, 12295190, 1570287, 61408414, 59121340, 32349527, 5773180, 26899704, 55412161, 2217935, 16462713, 9196065, 27802028, 40949245, 56007625, 56935379, 62187473, 20178048, 604912, 58533358, 652594, 44066849, 38942161, 6414816, 65617953, 51816492, 43489983, 6794967, 42345315, 59532019, 13107597, 63249029, 7371829, 12335467, 8560739, 7337238, 887484, 23493355, 41031080, 60537197, 12816159, 38446687, 51026618, 6374688, 18685270, 296645, 44439325, 4915968, 63449566, 25256236, 63407874, 36753301, 20728660, 7937228, 13058684, 636359, 50527238, 40450502, 1018943, 12438198, 3065732, 1510626, 5764344, 37840363, 27460227, 39751366, 5028066, 43956700, 56109033, 1412622, 44289534, 41770436, 49956562, 44409121, 47168726, 60953428, 52189251, 64281487, 51928150, 1002880, 4537354, 12605849, 477843, 6808167, 38629237, 33311208, 36109419, 58593881, 40362979, 440300, 9848361, 30656060, 15691500, 4375269, 15360849, 7077904, 66076532, 33350264, 4175651, 44006939, 21130605, 54098234, 53192890, 7371114, 12967808, 58061230, 34803392, 5544911, 28843825, 63244663, 38504887, 68565204, 1211255, 63427670, 56472411, 10548622, 43957279, 59313014, 5768723, 66210490, 25507112, 55472659, 61339767, 65684813, 45544639, 43710238, 46833245, 13028661, 13167268, 3064642, 62072249, 27631726, 65825964, 15540990, 64320858, 8605358, 17795606, 9894584, 543619, 2380700, 20959552, 57743104, 63917130, 38480348, 61393540, 19916851)
lc_A <- c(lc_A, 12528162, 7264617, 61480809, 36411752, 20139228, 21290880, 390228, 45584424, 17755019, 23413261, 15490914, 1254285, 875004, 24274579, 51006600, 11458143, 5125832, 37802077, 57327243, 41059894, 64978360, 58683523, 4290736, 40919379, 65029207, 7096004, 42285591, 7388784, 65914238, 46833088, 21221678, 62855006, 10557733, 44915714, 23083224, 67289213, 9746670, 349608, 66610322, 1595886, 3635144, 38419356, 9715410, 9726377, 621152, 23213635, 18685424, 65782663, 57304429, 20770003, 8865120, 58664359, 1454540, 42404539, 60952405, 61339308, 7367648, 11215938, 41207320, 23553299, 1681376, 7617266, 30485630, 10604792, 46044414, 63094909, 59189668, 10106916, 52058386, 17763104, 6396213, 8981232, 48070364, 10615808, 11956507, 38444903, 60216940, 58310439, 10099562, 7504691, 17533228, 62236540, 38626163, 55657128, 7728107, 42415348, 42454693, 4777573, 23834164, 25157042, 1339435, 50587486, 55998961, 32950014, 28422748, 492346, 50607472, 11335041, 4254623, 65058537, 5375256, 5646680, 44430975, 4054992, 55253292, 68375791, 16822421, 64978226, 59859214, 65424555, 10112206, 6908772, 67879649, 4794842, 31227479, 17423361, 64049774, 58624386, 14829134, 50233873, 44389635, 29684724, 452267, 43044890, 55942742, 19516366, 34443897, 57135665, 34392172, 17352839, 12896521, 40451807, 43255228, 40372428, 8568706, 68364520, 3486848, 40991148, 19196658, 8658538, 65885614, 38352455, 65674149, 1029473, 39290483, 47420355, 65364529, 32318884, 13115811, 48484348, 65975356, 56129109, 3378980, 31026386, 55231010, 41113253, 1480114, 51406116, 2445051, 8627441, 60942818, 55453270, 58573102, 25767158, 9655554, 49783137, 42273770, 32038806, 681948, 65059359, 48546050, 20169281, 68546780, 7065575, 46387142, 66180493, 58430918, 1390497, 41950574, 39888056, 11774847, 55308824, 51969105, 7936525, 5960208, 7700566, 14529825, 14688918, 43024566, 21110140, 55797803, 31236439, 6817136, 1467168, 36028128, 60781310, 66595886, 57548184, 3194733, 8589175, 1546517, 17654773, 40572454, 63284984, 5780985, 39660177, 64050493, 55081623, 51346675, 1235123, 65633931, 66390924, 17413278, 57950994, 55911330, 11814853, 31357211, 56038385, 40038565, 64400706, 35034758, 60296238, 6527713, 5685238, 1062701, 63406447, 64008930, 63476297, 5114652, 20060374, 10085133, 61328568, 9435001, 56057656, 49934674, 39661404, 19616499, 34342717, 46653815, 45614269, 59290211, 31296803, 50605437, 46928301, 58562582, 63879452, 65733359, 51086476, 40601201, 9845217, 29213549, 41227222, 7337659, 46517072, 38610653, 9694813, 21350102, 46716202, 50535150, 39729407, 22263578, 25987787, 64913590, 19636684, 59311687, 4295372, 571012, 20588847, 63424767, 1099384, 3810242, 5604591, 39760687, 43739869, 56019939, 51526987, 45494853, 4302122, 21009984, 66210827, 67255219, 46613149, 63345017, 43570211, 62002161, 2214708, 4234697, 51055338, 19647002, 28593783, 6804647, 40542044, 42263319, 4784593, 19636686, 44015285, 55697847, 5814660, 15409525, 2307393, 54404433, 15490230, 62245810, 64969544, 48120716, 41040511, 51176224, 6376426, 60386775, 826517, 27601385, 8185587, 28564285, 68613325, 58623041, 60941473, 1635691, 7729270, 46417835, 57285778, 55960993, 66510262, 60285691, 61902329, 68565071)


lc_B <- c(62012715, 49974687, 27570947, 63417796, 61449107, 12906517, 57074291, 21021086, 404854, 15139172, 46774978, 50486061, 4305577, 65783354, 48544529, 31667129, 36980133, 19117791, 3845908, 846821, 40381968, 64018601, 57184860, 49963980, 44142706, 6327771, 20811335, 67336862, 3628833, 31247310, 4764984, 1619549, 56492219, 67959628, 61672211, 1472227, 55268407, 13497237, 57538143, 43096178, 35723158, 226780, 2307012, 1210773, 50273799, 28903599, 50839792, 44916418, 9714937, 51876659, 3919804, 12968154, 54978278, 6938022, 53854432, 63350177, 39692948, 67216234, 22253060, 59099446, 46135199, 11717805, 48596572, 8475061, 61462130, 21480483, 2014943, 41430440, 43196143, 243173, 61543762, 66562164, 67878273, 41100627, 11915326, 28753020, 12617369, 59090559, 55583726, 31256585, 544537, 61430245, 1681767, 7670078, 38506546, 36500594, 31367711, 46694948, 2080069, 38457330, 54524836, 27651989, 63358477, 62002922, 8995111, 45694307, 61470409, 17933815, 27370082, 66612753, 1536521, 54948920, 57548472, 876991, 40127147, 57365210, 1904740, 3195692, 743529, 67408356, 8766184, 23643466, 51336378, 13397002, 3700020, 49935259, 38455198, 63506356, 11386690, 32479126, 6300017, 67427011, 63344398, 51366616, 727247, 59291548, 21551336, 8776003, 16111335, 1051513, 61973285, 60764833, 59190150, 25406927, 10138072, 61361677, 32279884, 63337618, 49933340, 30565592, 3217416, 61883095, 63436296, 58290318, 29884855, 50353289, 14699170, 67625637, 6815821, 2286867, 6274586, 17853756, 55948157, 6995898, 44126015, 66643915, 41338910, 8626219, 67858810, 38597465, 45884338, 565018, 46436141, 15259622, 6594706, 39479497, 5535388, 5855546, 48734782, 2896555, 67296211, 713979, 33110251, 8987918, 1224687, 5637315, 484473, 9814600, 29694710, 60902260, 25897153, 40705483, 1439301, 3055155, 26319992, 6245002, 66441896, 46427698, 36330836, 8915199, 46205024, 62459417, 3497439, 54888931, 30475522, 38998249, 12636103, 60536957)
lc_B <- c(lc_B, 27521279, 2365984, 361549, 43430210, 35843833, 9768308, 12705933, 59179388, 60830121, 67929084, 36138408, 854552, 8865548, 13096420, 23836169, 61502149, 1621627, 11426617, 48274995, 41123011, 7296181, 29635336, 30565882, 8145149, 46116481, 21119590, 43894290, 65866235, 44143687, 873468, 12419378, 26378681, 55140334, 56964922, 61682200, 14338072, 65047247, 57267246, 59581503, 41093708, 48524124, 513842, 1685090, 42723216, 60647576, 55341080, 9735578, 41110083, 30255415, 56010965, 63214550, 67828966, 671468, 38540004, 65107371, 18645038, 26017706, 660734, 573283, 9454644, 64017354, 617449, 7645594, 43286428, 55941273, 8636865, 31226902, 46194753, 6160505, 1412225, 65741544, 24084859, 58532795, 41880754, 45515321, 60585561, 65272380, 7937327, 1489732, 17553239, 7638498, 1473206, 38162164, 3355990, 15610681, 57025137, 6254978, 38162571, 52768311, 5938741, 58101279, 18895673, 30175739, 38222417, 55909312, 65663878, 6607837, 24725076, 61722475, 11895058, 28182084, 185962, 55259655, 16241080, 66602227, 5781939, 60801476, 6996130, 12346893, 65672013, 19076244, 1475379, 9056893, 59492895, 56864322, 60942704, 44015940, 62225220, 39739191, 66435524, 44199929, 59471139, 38547168, 6205030, 38615829, 6698930, 66514563, 1623685, 60545969, 46703319, 39739315, 12636426, 65364691, 16403147, 9204637, 19306532, 66270322, 65653692, 22313524, 59082682, 19796545, 10766253, 50436003, 49363132, 27600713, 44865530, 57763719, 47857115, 48535477, 65986020, 58603818, 42934257, 1167844, 66390187, 58281312, 63888770, 48596526, 67385135, 24775459, 55090096, 12347068, 37317537, 64007908, 1683908, 11976597, 41019342, 6855113, 7964638, 65701227, 44037648, 23133074, 9787718, 61389384, 38418035, 33130454, 13038119, 14639242, 38505864, 65725266, 62904623, 68513661, 36039498, 6538734, 51857455, 59139740, 64341225, 21430833, 55455899, 17795459, 65128493, 46428798, 43216120, 59199242, 50364311, 41079485, 27711293, 63218354, 65492649, 50819365, 40737432, 377507, 65736437, 61488876, 44886450, 31467727, 46651816, 11914779, 65352381, 24726593, 52989922, 43105128, 34322310, 8669148, 12795739, 38485516, 39559934, 4280915, 63437401, 7103037, 44946049, 15400322, 28583975, 59592185, 877645, 56019484, 3372858, 60556772, 19846532, 11658194, 6894823, 61414862, 52708301, 48806212, 12204849, 60863986, 3919883, 37661631, 47210580, 14689912, 23393084, 60961679, 6170889, 55191727, 14690280, 42415518, 65855022, 62156039, 38536464, 44603544, 63527328, 48182146, 25867085, 61952845, 4744682, 20110370, 65854766, 57722242, 11438361, 34111919, 53262232, 12247443, 64210396, 37630339, 41237564, 46722148, 65791211, 16882760, 7719304, 37622016, 3220774, 51906280, 12446784, 50064210, 57733299, 63437152, 38445791, 3730324, 56052115, 57354312, 58010576, 626701, 7224706, 64079786, 62167132, 8396526, 7625377, 12707224, 35084508, 56022111, 52027979, 43215589, 50425264, 59253209, 28312549, 67376619, 30795837, 43869662, 20849433, 55351366, 39549686, 22972745, 1025579)


# The specific member IDs in lc_A and lc_B are not in dataset lendingclub
lendingclub_ab <- lendingclub %>%
    mutate(Group=ifelse(member_id %in% lc_A, "A", ifelse(member_id %in% lc_B, "B", "C")))


# ggplot(lendingclub_ab, aes(x=Group, y=loan_amnt)) + geom_boxplot()

#conduct a two-sided t-test
# t.test(loan_amnt ~ Group, data=lendingclub_ab)


#build lendingclub_multi
# lendingclub_multi <-lm(loan_amnt ~ Group + grade + verification_status, data=lendingclub_ab)

#examine lendingclub_multi results
# broom::tidy(lendingclub_multi)

Chapter 3 - Randomized Complete (and Balanced Incomplete) Block Designs